<a href="https://colab.research.google.com/github/mk7890/Resume-Parsing-System/blob/main/Copy_of_ResumeParser_DistilBERT_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview

1. Preprocessing

Clean the text: Remove unnecessary characters.

3. Feature Extraction
Tokenization: Split the text into individual words or tokens.
Create an annotated dataset.

4. Building the Model

Train the model on annotated resumes where entities like name, job role, etc., are labeled.

5. Model Evaluation.

6. Saving and Deployment
Save the trained model using a library like joblib or pickle.

Deploy the model using Streamlit for an interactive web application.

# Loading Libraries

In [1]:
import pandas as pd
import numpy as np

In [None]:
#!pip install spacy transformers pdfplumber PyMuPDF torch joblib pickle5


Collecting pdfplumber
  Downloading pdfplumber-0.11.5-py3-none-any.whl.metadata (42 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/42.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.5/42.5 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting PyMuPDF
  Downloading pymupdf-1.25.4-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Collecting fuzzywuzzy
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl.metadata (4.9 kB)
Collecting pickle5
  Downloading pickle5-0.0.11.tar.gz (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.1/132.1 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pdfminer.six==20231228 (from pdfplumber)
  Downloading pdfminer.six-20231228-py3-none-any.whl.metadata (4.2 kB)
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Downloading pypdfium2-4.30.1-py3-none-m

# Load Dataset

In [2]:
jobs_df = pd.read_csv('/content/jobs_sampled.csv')
jobs_df.head()

Unnamed: 0,Name,Title,Role,Contact,Qualifications,Experience,Skills,Company
0,Christopher Duffy,Back-End Developer,API Developer,922.551.4444,MBA,3 to 14 Years,API design and development RESTful API knowled...,State Farm Insurance
1,Stephanie Morris,Back-End Developer,API Developer,806.716.2250x944,BA,5 to 14 Years,API design and development RESTful API knowled...,Capital One Financial
2,Anthony Taylor,Back-End Developer,API Developer,(953)310-0075x7268,B.Com,4 to 13 Years,API design and development RESTful API knowled...,Cummins
3,Jacqueline Anderson,Back-End Developer,API Developer,+1-923-200-8008,MBA,5 to 8 Years,API design and development RESTful API knowled...,Eastman Chemical
4,Angela Hall,Back-End Developer,API Developer,(246)327-9483,M.Com,3 to 15 Years,API design and development RESTful API knowled...,Analog Devices


In [3]:
# Step 1: Extract unique roles
unique_roles = jobs_df['Role'].unique()

# Step 2: Create a cyclic iterator for the unique roles
from itertools import cycle
role_cycle = cycle(unique_roles)

# Step 3: Assign roles to each row in an alternating manner
jobs_df['Role'] = [next(role_cycle) for _ in range(len(jobs_df))]

# Display the updated DataFrame
jobs_df.head()

Unnamed: 0,Name,Title,Role,Contact,Qualifications,Experience,Skills,Company
0,Christopher Duffy,Back-End Developer,API Developer,922.551.4444,MBA,3 to 14 Years,API design and development RESTful API knowled...,State Farm Insurance
1,Stephanie Morris,Back-End Developer,Accessibility Developer,806.716.2250x944,BA,5 to 14 Years,API design and development RESTful API knowled...,Capital One Financial
2,Anthony Taylor,Back-End Developer,Account Executive,(953)310-0075x7268,B.Com,4 to 13 Years,API design and development RESTful API knowled...,Cummins
3,Jacqueline Anderson,Back-End Developer,Account Manager,+1-923-200-8008,MBA,5 to 8 Years,API design and development RESTful API knowled...,Eastman Chemical
4,Angela Hall,Back-End Developer,Account Strategist,(246)327-9483,M.Com,3 to 15 Years,API design and development RESTful API knowled...,Analog Devices


In [None]:
jobs_df.head(49)

Unnamed: 0,Name,Title,Role,Contact,Qualifications,Experience,Skills,Company
0,Christopher Duffy,Back-End Developer,API Developer,922.551.4444,MBA,3 to 14 Years,API design and development RESTful API knowled...,State Farm Insurance
1,Stephanie Morris,Back-End Developer,Accessibility Developer,806.716.2250x944,BA,5 to 14 Years,API design and development RESTful API knowled...,Capital One Financial
2,Anthony Taylor,Back-End Developer,Account Executive,(953)310-0075x7268,B.Com,4 to 13 Years,API design and development RESTful API knowled...,Cummins
3,Jacqueline Anderson,Back-End Developer,Account Manager,+1-923-200-8008,MBA,5 to 8 Years,API design and development RESTful API knowled...,Eastman Chemical
4,Angela Hall,Back-End Developer,Account Strategist,(246)327-9483,M.Com,3 to 15 Years,API design and development RESTful API knowled...,Analog Devices
5,Olivia Oneill,Back-End Developer,Accounting Controller,001-971-792-2221x4550,BBA,2 to 12 Years,API design and development RESTful API knowled...,International Business Machines
6,Christopher Campbell,Back-End Developer,Accounting Manager,+1-975-518-5700x16656,BCA,1 to 11 Years,API design and development RESTful API knowled...,Newell Brands
7,Edward Griffin,Back-End Developer,Acute Care Nurse Practitioner,(883)210-3252x0822,MBA,0 to 15 Years,API design and development RESTful API knowled...,United Natural Foods
8,Samantha Henson,Back-End Developer,Addiction Counselor,877.297.2775x2285,PhD,0 to 10 Years,API design and development RESTful API knowled...,Knight-Swift Transportation Holdings
9,Jesse Snyder,Back-End Developer,Administrative Assistant,919.573.2212x880,BBA,0 to 10 Years,API design and development RESTful API knowled...,Avis Budget Group


In [4]:
df=jobs_df.copy()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37600 entries, 0 to 37599
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Name            37600 non-null  object
 1   Title           37600 non-null  object
 2   Role            37600 non-null  object
 3   Contact         37600 non-null  object
 4   Qualifications  37600 non-null  object
 5   Experience      37600 non-null  object
 6   Skills          37600 non-null  object
 7   Company         37600 non-null  object
dtypes: object(8)
memory usage: 2.3+ MB


In [6]:
df.columns

Index(['Name', 'Title', 'Role', 'Contact', 'Qualifications', 'Experience',
       'Skills', 'Company'],
      dtype='object')

In [5]:
import pandas as pd
import numpy as np
# Drop the specified columns: Name, Contact, and Experience
columns_to_drop = ['Name', 'Title', 'Contact', 'Qualifications', 'Experience', 'Company']
df = df.drop(columns=columns_to_drop)

# Display the updated DataFrame
df.head()

Unnamed: 0,Role,Skills
0,API Developer,API design and development RESTful API knowled...
1,Accessibility Developer,API design and development RESTful API knowled...
2,Account Executive,API design and development RESTful API knowled...
3,Account Manager,API design and development RESTful API knowled...
4,Account Strategist,API design and development RESTful API knowled...


In [6]:
df.columns

Index(['Role', 'Skills'], dtype='object')

In [7]:
import re

# Function to clean the Skills column
def clean_skills(skills):
    # Remove special characters, brackets, and phrases like "e.g."
    skills = re.sub(r'[()\[\]]', '', skills)  # Remove parentheses and brackets
    skills = re.sub(r'e\.g\.,?', '', skills)  # Remove "e.g." or "e.g,"
    skills = re.sub(r'[^a-zA-Z0-9, ]', '', skills)  # Keep only letters, numbers, commas, and spaces
    skills = re.sub(r'\s+', ' ', skills)  # Replace multiple spaces with a single space
    return skills.strip()  # Remove leading/trailing whitespace

# Apply the cleaning function to the Skills column
df['Skills'] = df['Skills'].apply(clean_skills)

# Display the updated DataFrame
df.head()

Unnamed: 0,Role,Skills
0,API Developer,API design and development RESTful API knowled...
1,Accessibility Developer,API design and development RESTful API knowled...
2,Account Executive,API design and development RESTful API knowled...
3,Account Manager,API design and development RESTful API knowled...
4,Account Strategist,API design and development RESTful API knowled...


# Preprocessing

## 1. Understand the Problem and Prepare the Data
Your goal is to train a DistilBERT-based NER model that can identify and classify entities in resumes into predefined categories such as Name, Title, Role, Contact, etc.

Key Challenges:

The input data is structured in a tabular format (pandas DataFrame), but NER models require token-level annotations.

You need to convert your tabular data into a format where each token in the text is labeled with its corresponding entity tag.

Solution:

You need to transform your structured data into a sequence labeling task. For example:

Input text: "John Doe worked as a Software Engineer at Google."

Expected output: ["B-Name", "I-Name", "O", "O", "O", "B-Role", "I-Role", "O", "O", "B-Company", "O"]
Here, B- denotes the beginning of an entity, I- denotes inside an entity, and O denotes no entity.

- Combined Fields into a Single Text:
We concatenated all fields (Name, Title, Role, etc.) into a single text column for each resume.
- Generated Word-Level Labels:
Using the structured data, we assigned labels (B-Name, I-Title, etc.) to each word in the combined text.
- Tokenized the Text:
We used the DistilBertTokenizerFast to tokenize the text into subword tokens.
- Aligned Labels with Tokens:
We aligned the word-level labels with the tokenized output using the offset_mapping feature.
Subword tokens inherit the label of their parent word.
- Verified the Alignment:
By decoding the tokens and inspecting their corresponding labels, we confirmed that the alignment is correct.

Create Word-level Labels Before tokenization

In [None]:
'''
# Update the create_word_labels function to also add a 'text' column
def create_word_labels(row):
    # Combine all fields into a single text
    text = f"{row['Name']} {row['Title']} {row['Role']} {row['Contact']} {row['Qualifications']} {row['Experience']} {row['Skills']} {row['Company']}"
    words = text.split()  # Split the text into words

    # Initialize labels as "O" (no entity)
    labels = ["O"] * len(words)

    # Helper function to assign labels to a specific field
    def assign_labels(field_value, entity_prefix):
        if pd.isna(field_value) or field_value.strip() == "":
            return
        field_words = field_value.split()
        start_idx = None
        for i, word in enumerate(words):
            if word == field_words[0] and words[i:i+len(field_words)] == field_words:
                start_idx = i
                break
        if start_idx is not None:
            labels[start_idx] = f"B-{entity_prefix}"  # Beginning of the entity
            for j in range(start_idx + 1, start_idx + len(field_words)):
                labels[j] = f"I-{entity_prefix}"  # Inside the entity

    # Assign labels for each field
    assign_labels(row['Name'], "Name")
    assign_labels(row['Title'], "Title")
    assign_labels(row['Role'], "Role")
    assign_labels(row['Contact'], "Contact")
    assign_labels(row['Qualifications'], "Qualifications")
    assign_labels(str(row['Experience']), "Experience")  # Convert Experience to string
    assign_labels(row['Skills'], "Skills")
    assign_labels(row['Company'], "Company")

    return text, labels  # Return both the combined text and the labels

# Apply the function and create 'text' and 'word_labels' columns
df['text'], df['word_labels'] = zip(*df.apply(create_word_labels, axis=1))
'''

In [8]:
# Update the create_word_labels function to also add a 'text' column
def create_word_labels(row):
    # Combine all fields into a single text
    text = f"{row['Role']} {row['Skills']}"
    words = text.split()  # Split the text into words

    # Initialize labels as "O" (no entity)
    labels = ["O"] * len(words)

    # Helper function to assign labels to a specific field
    def assign_labels(field_value, entity_prefix):
        if pd.isna(field_value) or field_value.strip() == "":
            return
        field_words = field_value.split()
        start_idx = None
        for i, word in enumerate(words):
            if word == field_words[0] and words[i:i+len(field_words)] == field_words:
                start_idx = i
                break
        if start_idx is not None:
            labels[start_idx] = f"B-{entity_prefix}"  # Beginning of the entity
            for j in range(start_idx + 1, start_idx + len(field_words)):
                labels[j] = f"I-{entity_prefix}"  # Inside the entity

    # Assign labels for each field

    assign_labels(row['Role'], "Role")
    # Convert Skills to string
    assign_labels(row['Skills'], "Skills")


    return text, labels  # Return both the combined text and the labels

# Apply the function and create 'text' and 'word_labels' columns
df['text'], df['word_labels'] = zip(*df.apply(create_word_labels, axis=1))

In [9]:
# Set pandas display options to show full strings without truncation
pd.set_option('display.max_colwidth', None)  # No truncation for column width
pd.set_option('display.max_rows', None)      # Optional: Show all rows if needed

# Inspect the first 5 rows of the 'text' and 'word_labels' columns
print(df[['text', 'word_labels']].head(1))

                                                                                           text  \
0  API Developer API design and development RESTful API knowledge Security protocols OAuth, JWT   

                                                                                                                      word_labels  
0  [B-Role, I-Role, B-Skills, I-Skills, I-Skills, I-Skills, I-Skills, I-Skills, I-Skills, I-Skills, I-Skills, I-Skills, I-Skills]  


In [10]:
df.head()

Unnamed: 0,Role,Skills,text,word_labels
0,API Developer,"API design and development RESTful API knowledge Security protocols OAuth, JWT","API Developer API design and development RESTful API knowledge Security protocols OAuth, JWT","[B-Role, I-Role, B-Skills, I-Skills, I-Skills, I-Skills, I-Skills, I-Skills, I-Skills, I-Skills, I-Skills, I-Skills, I-Skills]"
1,Accessibility Developer,"API design and development RESTful API knowledge Security protocols OAuth, JWT","Accessibility Developer API design and development RESTful API knowledge Security protocols OAuth, JWT","[B-Role, I-Role, B-Skills, I-Skills, I-Skills, I-Skills, I-Skills, I-Skills, I-Skills, I-Skills, I-Skills, I-Skills, I-Skills]"
2,Account Executive,"API design and development RESTful API knowledge Security protocols OAuth, JWT","Account Executive API design and development RESTful API knowledge Security protocols OAuth, JWT","[B-Role, I-Role, B-Skills, I-Skills, I-Skills, I-Skills, I-Skills, I-Skills, I-Skills, I-Skills, I-Skills, I-Skills, I-Skills]"
3,Account Manager,"API design and development RESTful API knowledge Security protocols OAuth, JWT","Account Manager API design and development RESTful API knowledge Security protocols OAuth, JWT","[B-Role, I-Role, B-Skills, I-Skills, I-Skills, I-Skills, I-Skills, I-Skills, I-Skills, I-Skills, I-Skills, I-Skills, I-Skills]"
4,Account Strategist,"API design and development RESTful API knowledge Security protocols OAuth, JWT","Account Strategist API design and development RESTful API knowledge Security protocols OAuth, JWT","[B-Role, I-Role, B-Skills, I-Skills, I-Skills, I-Skills, I-Skills, I-Skills, I-Skills, I-Skills, I-Skills, I-Skills, I-Skills]"


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37600 entries, 0 to 37599
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Role         37600 non-null  object
 1   Skills       37600 non-null  object
 2   text         37600 non-null  object
 3   word_labels  37600 non-null  object
dtypes: object(4)
memory usage: 1.1+ MB


In [12]:
!pip install datasets



In [13]:
from transformers import DistilBertTokenizerFast
from datasets import Dataset

# Load the fast tokenizer
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

# Define the label-to-id mapping
'''
unique_labels = ["O", "B-Name", "I-Name", "B-Title", "I-Title", "B-Role", "I-Role",
                 "B-Contact", "I-Contact", "B-Qualifications", "I-Qualifications",
                 "B-Experience", "I-Experience", "B-Skills", "I-Skills",
                 "B-Company", "I-Company"]
'''
unique_labels = ["O", "B-Role", "I-Role", "B-Skills", "I-Skills"]
label_to_id = {label: i for i, label in enumerate(unique_labels)}

# Function to tokenize and align labels
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples['text'],
        truncation=True,
        padding='max_length',
        max_length=512,
        return_offsets_mapping=True
    )

    labels = []
    for i, offset_mapping in enumerate(tokenized_inputs['offset_mapping']):
        # Get the original text and its word-level labels
        text = examples['text'][i]
        word_labels = examples['word_labels'][i]

        # Split the text into words
        words = text.split()

        # Align labels with tokens
        label_ids = []
        current_word_idx = -1  # Tracks the current word index
        for start, end in offset_mapping:
            if start == end:  # Special tokens or padding
                label_ids.append(label_to_id["O"])
            else:
                # Find the word corresponding to the token
                # Increment the word index only when a new word starts
                if current_word_idx == -1 or start >= len(" ".join(words[:current_word_idx + 1])):
                    current_word_idx += 1

                # Assign the label of the current word to the token
                if current_word_idx < len(word_labels):
                    label_ids.append(label_to_id[word_labels[current_word_idx]])
                else:
                    label_ids.append(label_to_id["O"])  # Default to "O" if no label exists

        labels.append(label_ids)

    # Remove offset_mapping as it's not needed for training
    tokenized_inputs.pop("offset_mapping")
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [14]:
# Convert DataFrame to Hugging Face Dataset
dataset = Dataset.from_pandas(df)

# Tokenize and align labels for the entire dataset
tokenized_dataset = dataset.map(tokenize_and_align_labels, batched=True)

Map:   0%|          | 0/37600 [00:00<?, ? examples/s]

In [15]:
# Print the first example to verify
print("Tokenized Dataset Example:")
print(tokenized_dataset[0])

Tokenized Dataset Example:
{'Role': 'API Developer', 'Skills': 'API design and development RESTful API knowledge Security protocols OAuth, JWT', 'text': 'API Developer API design and development RESTful API knowledge Security protocols OAuth, JWT', 'word_labels': ['B-Role', 'I-Role', 'B-Skills', 'I-Skills', 'I-Skills', 'I-Skills', 'I-Skills', 'I-Skills', 'I-Skills', 'I-Skills', 'I-Skills', 'I-Skills', 'I-Skills'], 'input_ids': [101, 17928, 9722, 17928, 2640, 1998, 2458, 2717, 3993, 17928, 3716, 3036, 16744, 1051, 4887, 2705, 1010, 1046, 26677, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

In [16]:
# Decode the input_ids to see the tokens
decoded_tokens = tokenizer.convert_ids_to_tokens(tokenized_dataset[0]['input_ids'])

# Print the decoded tokens and their corresponding labels
for token, label_id in zip(decoded_tokens, tokenized_dataset[0]['labels']):
    label = [k for k, v in label_to_id.items() if v == label_id][0]  # Convert label ID back to label name
    print(f"Token: {token}, Label: {label}")

Token: [CLS], Label: O
Token: api, Label: B-Role
Token: developer, Label: I-Role
Token: api, Label: B-Skills
Token: design, Label: I-Skills
Token: and, Label: I-Skills
Token: development, Label: I-Skills
Token: rest, Label: I-Skills
Token: ##ful, Label: I-Skills
Token: api, Label: I-Skills
Token: knowledge, Label: I-Skills
Token: security, Label: I-Skills
Token: protocols, Label: I-Skills
Token: o, Label: I-Skills
Token: ##au, Label: I-Skills
Token: ##th, Label: I-Skills
Token: ,, Label: I-Skills
Token: j, Label: I-Skills
Token: ##wt, Label: I-Skills
Token: [SEP], Label: O
Token: [PAD], Label: O
Token: [PAD], Label: O
Token: [PAD], Label: O
Token: [PAD], Label: O
Token: [PAD], Label: O
Token: [PAD], Label: O
Token: [PAD], Label: O
Token: [PAD], Label: O
Token: [PAD], Label: O
Token: [PAD], Label: O
Token: [PAD], Label: O
Token: [PAD], Label: O
Token: [PAD], Label: O
Token: [PAD], Label: O
Token: [PAD], Label: O
Token: [PAD], Label: O
Token: [PAD], Label: O
Token: [PAD], Label: O
Token:

In [17]:
# Function to inspect the tokenized dataset
def inspect_tokenized_output(tokenized_dataset, index):
    # Get the example at the specified index
    example = tokenized_dataset[index]

    # Extract relevant fields
    original_text = example['text']
    word_labels = example['word_labels']
    input_ids = example['input_ids']
    labels = example['labels']

    # Convert input_ids back to tokens using the tokenizer
    tokens = tokenizer.convert_ids_to_tokens(input_ids)

    # Print the original text
    print("Original Text:")
    print(original_text)
    print()

    # Print the word-level labels
    print("Word-Level Labels:")
    print(word_labels)
    print()

    # Print the tokenized input_ids, tokens, and aligned labels
    print("Tokenized Input IDs, Tokens, and Aligned Labels:")
    for token_id, token, label_id in zip(input_ids, tokens, labels):
        # Skip padding tokens for clarity
        if token_id == 0:  # Padding token
            continue
        print(f"Input ID: {token_id}, Token: {token}, Label: {label_id}")
    print("-" * 80)

# Example: Inspect the first example in the tokenized dataset
inspect_tokenized_output(tokenized_dataset, index=2)

Original Text:
Account Executive API design and development RESTful API knowledge Security protocols OAuth, JWT

Word-Level Labels:
['B-Role', 'I-Role', 'B-Skills', 'I-Skills', 'I-Skills', 'I-Skills', 'I-Skills', 'I-Skills', 'I-Skills', 'I-Skills', 'I-Skills', 'I-Skills', 'I-Skills']

Tokenized Input IDs, Tokens, and Aligned Labels:
Input ID: 101, Token: [CLS], Label: 0
Input ID: 4070, Token: account, Label: 1
Input ID: 3237, Token: executive, Label: 2
Input ID: 17928, Token: api, Label: 3
Input ID: 2640, Token: design, Label: 4
Input ID: 1998, Token: and, Label: 4
Input ID: 2458, Token: development, Label: 4
Input ID: 2717, Token: rest, Label: 4
Input ID: 3993, Token: ##ful, Label: 4
Input ID: 17928, Token: api, Label: 4
Input ID: 3716, Token: knowledge, Label: 4
Input ID: 3036, Token: security, Label: 4
Input ID: 16744, Token: protocols, Label: 4
Input ID: 1051, Token: o, Label: 4
Input ID: 4887, Token: ##au, Label: 4
Input ID: 2705, Token: ##th, Label: 4
Input ID: 1010, Token: ,, La

In [None]:
'''
id_to_label = {
    0: "O",
    1: "B-Name",
    2: "I-Name",
    3: "B-Title",
    4: "I-Title",
    5: "B-Role",
    6: "I-Role",
    7: "B-Contact",
    8: "I-Contact",
    9: "B-Qualifications",
    10: "I-Qualifications",
    11: "B-Experience",
    12: "I-Experience",
    13: "B-Skills",
    14: "I-Skills",
    15: "B-Company",
    16: "I-Company"
}
'''

In [18]:
# Reverse mapping
id_to_label = {v: k for k, v in label_to_id.items()}
print(id_to_label[4])  # Output: "B-Contact"

I-Skills


## Train-Test Split the Dataset

- Split the dataset into training, validation, and test sets. This ensures that we have separate datasets for training, tuning, and evaluating the model.

python


In [19]:
# Split the dataset into train and test sets
train_test_split = tokenized_dataset.train_test_split(test_size=0.2)
train_dataset = train_test_split['train']
test_dataset = train_test_split['test']

# Optionally, split the training set further into train and validation sets
train_val_split = train_dataset.train_test_split(test_size=0.1)
train_dataset = train_val_split['train']
val_dataset = train_val_split['test']

print(f"Training examples: {len(train_dataset)}")
print(f"Validation examples: {len(val_dataset)}")
print(f"Test examples: {len(test_dataset)}")

Training examples: 27072
Validation examples: 3008
Test examples: 7520


In [20]:
from datasets import DatasetDict

# Save datasets to disk
train_dataset.save_to_disk("train_dataset")
val_dataset.save_to_disk("val_dataset")
test_dataset.save_to_disk("test_dataset")

print("Datasets saved successfully.")

Saving the dataset (0/1 shards):   0%|          | 0/27072 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/3008 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/7520 [00:00<?, ? examples/s]

Datasets saved successfully.


In [None]:
!mv /content/train_dataset /content/drive/MyDrive/datasets/
!mv /content/test_dataset /content/drive/MyDrive/datasets/
!mv /content/val_dataset /content/drive/MyDrive/datasets/

In [None]:
!ls /content/drive/MyDrive/datasets/

data-00000-of-00001.arrow  dataset_info.json  state.json  test_dataset	val_dataset


In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.4.1-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.4.1-py3-none-any.whl (487 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m487.4/487.4 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.12.0-py3-none-any.wh

In [None]:
# Load dataset from googl drive to avoid data preprocessing.
'''
from datasets import load_from_disk
from google.colab import drive

# Step 1: Mount Google Drive
drive.mount('/content/drive')

# Step 2: Load datasets from Google Drive
train_dataset = load_from_disk("/content/drive/MyDrive/datasets/train_dataset")
test_dataset = load_from_disk("/content/drive/MyDrive/datasets/test_dataset")
val_dataset = load_from_disk("/content/drive/MyDrive/datasets/val_dataset")

print(f"Training examples: {len(train_dataset)}")
print(f"Validation examples: {len(val_dataset)}")
print(f"Test examples: {len(test_dataset)}")
'''

Training examples: 27072
Validation examples: 3008
Test examples: 7520


The train_test_split method splits the dataset into two parts:
80% of the data is allocated to the training set (train_dataset).
20% of the data is allocated to the test set (test_dataset).

The training set (train_dataset) is further split into:
90% of the original training data remains as the new training set (train_dataset).
10% of the original training data is allocated to the validation set (val_dataset).

Separate Test Set: The test set (test_dataset) is kept completely separate from the training and validation sets. It ensures that the model's performance can be evaluated on unseen data.

Validation Set for Hyperparameter Tuning: The validation set (val_dataset) is used during training to monitor performance and tune hyperparameters without overfitting to the training data.

Proportional Splits: The proportions (80% train, 10% validation, 10% test) are commonly used in machine learning and strike a good balance between having enough data for training and evaluation.

In [None]:
'''
train_dataset = train_dataset.select(range(3000))  # Use only the first 1000 examples
val_dataset = val_dataset.select(range(1000))       # Use only the first 200 examples
test_dataset = test_dataset.select(range(1000))     # Use only the first 200 examples

# Verify the split
print(f"Training examples: {len(train_dataset)}")
print(f"Validation examples: {len(val_dataset)}")
print(f"Test examples: {len(test_dataset)}")
'''

Training examples: 3000
Validation examples: 1000
Test examples: 1000


In [21]:
# Calculate 20% of each dataset size
train_size = int(len(train_dataset) * 0.2)
val_size = int(len(val_dataset) * 0.2)
test_size = int(len(test_dataset) * 0.2)

# Select 20% of each dataset
train_dataset = train_dataset.select(range(train_size))
val_dataset = val_dataset.select(range(val_size))
test_dataset = test_dataset.select(range(test_size))

# Verify the split
print(f"Training examples: {len(train_dataset)}")
print(f"Validation examples: {len(val_dataset)}")
print(f"Test examples: {len(test_dataset)}")

Training examples: 5414
Validation examples: 601
Test examples: 1504


# Model Training

- Load a pretrained DistilBERT model for token classification. This model will be fine-tuned on your dataset.

In [25]:
from transformers import DistilBertForTokenClassification

id_to_label = {i: label for label, i in label_to_id.items()}  # Reverse mapping
# Compute the number of unique labels
num_labels = len(label_to_id)

# Load the pretrained DistilBERT model
model = DistilBertForTokenClassification.from_pretrained(
    'distilbert-base-uncased',
    num_labels=num_labels,  # Number of unique labels
    id2label=id_to_label,        # Mapping from ID to label
    label2id=label_to_id         # Mapping from label to ID
)

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [23]:
print(f"Model loaded with {num_labels} unique labels.")

Model loaded with 5 unique labels.


## Set Up Training Arguments

- Define the training arguments for fine-tuning the model. These include parameters like learning rate, batch size, and number of epochs.

In [26]:
import os
os.environ["HF_DISABLE_TQDM"] = "0"  # Ensure tqdm progress bars are enabled

In [27]:
!pip install tqdm



In [29]:
from transformers import Trainer, TrainingArguments
#from tqdm.auto import tqdm  # Import tqdm for progress bars
from tqdm.notebook import tqdm # use notebook compatible version of tdqm
from transformers.trainer_callback import TrainerCallback

# Define a custom callback to log both training and validation loss
class LossLoggingCallback(TrainerCallback):
    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs is not None:
            # Log training loss
            if 'loss' in logs:
                print(f"Training Loss: {logs['loss']:.4f}")
            # Log validation loss
            if 'eval_loss' in logs:
                print(f"Validation Loss: {logs['eval_loss']:.4f}")

# Define the training arguments
training_args = TrainingArguments(
    output_dir='./results',              # Directory to save model checkpoints
    evaluation_strategy="steps",        # Evaluate every `eval_steps` steps (faster than per epoch)
    eval_steps=100,                      # Evaluate every 100 steps
    learning_rate=5e-5,                  # Slightly higher learning rate for faster convergence
    per_device_train_batch_size=8,      # Reduced batch size to fit CPU memory
    per_device_eval_batch_size=8,       # Reduced batch size for evaluation
    gradient_accumulation_steps=8,      # Simulate larger batch sizes by accumulating gradients
    num_train_epochs=1,                 # Reduce number of epochs to 1 for faster training
    weight_decay=0.01,                  # Weight decay for regularization
    save_strategy="no",                 # Disable saving models to save time and disk space
    logging_dir='./logs',               # Directory for logs
    logging_steps=100,                   # Log every 100 steps
    disable_tqdm=False,                 # Enable tqdm progress bar
    optim="adamw_torch_fused",          # Use fused AdamW optimizer for better CPU performance
    load_best_model_at_end=False,       # Skip loading the best model at the end
    metric_for_best_model=None,         # No need to track metrics if not saving the best model
    greater_is_better=None,             # Not applicable without `metric_for_best_model`
    fp16=False                          # Mixed precision (`fp16`) is only effective on GPUs
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    #callbacks=[LossLoggingCallback()]   # Add the custom callback
)

# Fine-tune the model with a progress bar
print("Starting training...")
trainer.train()

  trainer = Trainer(


Starting training...


Step,Training Loss,Validation Loss


TrainOutput(global_step=84, training_loss=0.043372259253547304, metrics={'train_runtime': 16591.2605, 'train_samples_per_second': 0.326, 'train_steps_per_second': 0.005, 'total_flos': 702429166043136.0, 'train_loss': 0.043372259253547304, 'epoch': 0.9926144756277696})

Train the Model
- Use the Hugging Face Trainer API to fine-tune the model on your dataset.

Wandb pass : 808cfe592549ec7dfe71ae4c4afc5ed37d38b094


# Model Evaluation

Evaluate the model on the test set to measure its performance. Use metrics like F1-score, precision, and recall.

Cross-Entropy Loss

- Cross-entropy loss measures the difference between the predicted probability distribution and the true labels. In token classification tasks:

- The model outputs logits (unnormalized scores) for each token.
These logits are passed through a softmax function to produce probabilities.
- Cross-entropy loss is computed between the predicted probabilities and the ground-truth labels.
- cross-entropy loss is being logged during training and evaluation.

In [30]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [31]:
!pip install seqeval

Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16161 sha256=77b75505cd47b05611abc5c4aa53670209cf1f84ec2b1ad4fd0264163f7bfe79
  Stored in directory: /root/.cache/pip/wheels/bc/92/f0/243288f899c2eacdfa8c5f9aede4c71a9bad0ee26a01dc5ead
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2


In [36]:
from seqeval.metrics import classification_report

# Predict on the test set
predictions, labels, _ = trainer.predict(test_dataset)  # Unpack all three values
preds = predictions.argmax(axis=2)  # Convert logits to predicted label IDs

# Convert IDs back to labels
true_labels = [[id_to_label[l] for l in label if l != -100] for label in labels]
true_predictions = [
    [id_to_label[p] for p, l in zip(prediction, label) if l != -100]
    for prediction, label in zip(preds, labels)
]

# Print the classification report
print(classification_report(true_labels, true_predictions))

              precision    recall  f1-score   support

        Role       0.44      0.40      0.42      1732
      Skills       0.53      0.50      0.51      1691

   micro avg       0.49      0.45      0.47      3423
   macro avg       0.49      0.45      0.47      3423
weighted avg       0.49      0.45      0.47      3423



In [38]:
# Evaluate the model on the test dataset
print("Evaluating on the test set...")
test_results = trainer.evaluate(test_dataset)

# Print the evaluation results
print(f"Test Loss (Cross-Entropy): {test_results['eval_loss']:.4f}")

Evaluating on the test set...


Test Loss (Cross-Entropy): 0.0084


## Save the Model
- Once training is complete, save the fine-tuned model and tokenizer for inference.

In [32]:
# Save the model and tokenizer
model.save_pretrained('./ner_model')
tokenizer.save_pretrained('./ner_model')

('./ner_model/tokenizer_config.json',
 './ner_model/special_tokens_map.json',
 './ner_model/vocab.txt',
 './ner_model/added_tokens.json',
 './ner_model/tokenizer.json')

In [33]:
import shutil

# Define the folder path and output zip file name
folder_path = '/content/ner_model'
output_zip_name = '/content/ner_model.zip'

# Zip the folder
shutil.make_archive(output_zip_name.replace('.zip', ''), 'zip', folder_path)

print(f"Folder '{folder_path}' has been zipped to '{output_zip_name}'.")

Folder '/content/ner_model' has been zipped to '/content/ner_model.zip'.


save model to ner_model folder contents : config.json, model.safetensors, special_tokens_map.json, tokenizer.json, tokenizer_config.json, vocab.txt

In [34]:
from google.colab import drive
import shutil

# Step 1: Mount Google Drive
drive.mount('/content/drive')

# Step 2: Define the source folder and destination path in Google Drive
source_folder = '/content/ner_model'
destination_folder = '/content/drive/MyDrive/ner_model2'

# Step 3: Copy the folder to Google Drive
shutil.copytree(source_folder, destination_folder)

print(f"Folder '{source_folder}' has been uploaded to Google Drive at '{destination_folder}'.")

Mounted at /content/drive
Folder '/content/ner_model' has been uploaded to Google Drive at '/content/drive/MyDrive/ner_model2'.


How model.save_pretrained() Works:

This method saves:
- The model's architecture (structure).
- The model's learned weights (parameters).
- Additional metadata, such as the label mappings (id2label and label2id).

The saved files are typically stored in the specified directory (./ner_model in this case) and include:

- config.json: Contains the model's configuration (e.g., number of layers, hidden size, etc.).
- pytorch_model.bin: Contains the model's weights (learned parameters).
- tokenizer_config.json and vocab.txt: Contain the tokenizer's configuration and vocabulary.


**Importance of Saving the Tokenizer**

- Consistency: The tokenizer is responsible for converting raw text into tokens that the model understands. If you don't save the tokenizer alongside the model, you won't be able to preprocess new input data consistently.
- Tokenization Alignment: The tokenizer ensures that the tokenization process during inference matches the process used during training. This alignment is critical for accurate predictions.

How tokenizer.save_pretrained() Works:

This method saves:
- The tokenizer's vocabulary.
- Special tokens (e.g., [CLS], [SEP]).
- Tokenization rules (e.g., subword splitting logic).

The saved files are stored in the same directory (./ner_model) and include:

- tokenizer_config.json: Contains the tokenizer's configuration.
- vocab.txt (or similar): Contains the vocabulary used by the tokenizer.

## Use the Model for Inference
- Test the fine-tuned model to extract entities from new resumes.

In [39]:
from transformers import DistilBertForTokenClassification, DistilBertTokenizerFast

# Load the saved model and tokenizer
model = DistilBertForTokenClassification.from_pretrained('./ner_model')
tokenizer = DistilBertTokenizerFast.from_pretrained('./ner_model')

# Example input text
text = ["Social Media Manager and Marketing Specialist. Poses skills in UI design, statistical analysis, Content creation, design principles, Usability testing, API desing and development, Social media, Data Analysis, Probelm Solving"]
#text = "Stephanie Morris	Back-End Developer	API Developer	806.716.2250x944	BA	5 to 14 Years	API design and development RESTful API knowled...	Capital One Financial"
# Tokenize the input text
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

# Make predictions
outputs = model(**inputs)
predictions = outputs.logits.argmax(dim=-1).squeeze().tolist()  # Get predicted labels

# Decode the tokens
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'].squeeze().tolist())

# Get the id2label mapping
id2label = model.config.id2label

# Function to extract entities
def extract_entities(tokens, predictions, id2label):
    entities = []
    current_entity = None
    current_tokens = []

    for token, label_id in zip(tokens, predictions):
        label = id2label[label_id]

        # Skip special tokens ([CLS], [SEP], [PAD])
        if token in ["[CLS]", "[SEP]", "[PAD]"]:
            continue

        # Handle subword tokens (e.g., "##doe")
        token_clean = token.replace("##", "")

        if label.startswith("B-"):  # Beginning of a new entity
            if current_entity:
                # Save the previous entity
                entities.append((current_entity, "".join(current_tokens)))
            current_entity = label[2:]  # Extract entity type (e.g., "Name")
            current_tokens = [token_clean]
        elif label.startswith("I-") and current_entity:  # Inside an entity
            current_tokens.append(token_clean)
        else:  # Outside any entity or end of an entity
            if current_entity:
                # Save the previous entity
                entities.append((current_entity, "".join(current_tokens)))
            current_entity = None
            current_tokens = []

    # Add the last entity if it exists
    if current_entity:
        entities.append((current_entity, "".join(current_tokens)))

    return entities

def post_process_entities(entities):
    processed_entities = []
    current_entity_type = None
    current_entity_text = []

    for entity_type, entity_text in entities:
        if entity_type == current_entity_type:
            # Append to the current entity
            current_entity_text.append(entity_text)
        else:
            # Save the previous entity
            if current_entity_type:
                processed_entities.append((current_entity_type, " ".join(current_entity_text)))
            # Start a new entity
            current_entity_type = entity_type
            current_entity_text = [entity_text]

    # Add the last entity
    if current_entity_type:
        processed_entities.append((current_entity_type, " ".join(current_entity_text)))

    return processed_entities

# Extract entities
entities = extract_entities(tokens, predictions, id2label)

# Post-process the extracted entities
processed_entities = post_process_entities(entities)

# Print the processed entities
print("Processed Entities:")
for entity_type, entity_text in processed_entities:
    print(f"{entity_type}: {entity_text}")

'''
# Print the extracted entities
print("Extracted Entities:")
for entity_type, entity_text in entities:
    print(f"{entity_type}: {entity_text}")
'''

Processed Entities:
Skills: socialmediamanagerandmarketingspecialist.posesskillsinuidesign,statisticalanalysis,contentcreation,designprinciples,usabilitytesting,apidesinganddevelopment,socialmedia,dataanalysis,probelmsolving


'\n# Print the extracted entities\nprint("Extracted Entities:")\nfor entity_type, entity_text in entities:\n    print(f"{entity_type}: {entity_text}")\n'

# Streamlit

In [None]:
#!pip install streamlit transformers PyPDF2 scikit-learn

In [None]:
import streamlit as st
from transformers import DistilBertForTokenClassification, DistilBertTokenizerFast
from PyPDF2 import PdfReader
import re
import torch
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Load the saved model and tokenizer
@st.cache_resource
def load_model_and_tokenizer():
    model = DistilBertForTokenClassification.from_pretrained('./ner_model')
    tokenizer = DistilBertTokenizerFast.from_pretrained('./ner_model')
    return model, tokenizer

model, tokenizer = load_model_and_tokenizer()

# Function to extract text from PDF
def extract_text_from_pdf(file):
    reader = PdfReader(file)
    text = ""
    for page in reader.pages:
        text += page.extract_text()
    return text

# Function to clean and preprocess text
def preprocess_text(text):
    text = re.sub(r'\s+', ' ', text)  # Remove extra spaces
    text = text.strip()
    return text

# Function to extract entities using the NER model
def extract_entities(text, model, tokenizer):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    outputs = model(**inputs)
    predictions = outputs.logits.argmax(dim=-1).squeeze().tolist()

    tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'].squeeze().tolist())
    id2label = model.config.id2label

    entities = []
    current_entity = None
    current_tokens = []

    for token, label_id in zip(tokens, predictions):
        label = id2label[label_id]

        if token in ["[CLS]", "[SEP]", "[PAD]"]:
            continue

        token_clean = token.replace("##", "")

        if label.startswith("B-"):
            if current_entity:
                entities.append((current_entity, "".join(current_tokens)))
            current_entity = label[2:]
            current_tokens = [token_clean]
        elif label.startswith("I-") and current_entity:
            current_tokens.append(token_clean)
        else:
            if current_entity:
                entities.append((current_entity, "".join(current_tokens)))
            current_entity = None
            current_tokens = []

    if current_entity:
        entities.append((current_entity, "".join(current_tokens)))

    return entities

# Function to calculate similarity score between resume and job description
def calculate_similarity(resume_text, job_description):
    vectorizer = TfidfVectorizer().fit([resume_text, job_description])
    vectors = vectorizer.transform([resume_text, job_description])
    similarity = cosine_similarity(vectors)[0][1]
    return round(similarity * 100, 2)

# Streamlit app
st.title("Resume Parsing and Ranking App")

# Sidebar options
option = st.sidebar.selectbox(
    "Choose an option",
    ("Single Resume Parsing", "Batch Resume Ranking")
)

if option == "Single Resume Parsing":
    st.header("Single Resume Parsing")
    uploaded_file = st.file_uploader("Upload a resume (PDF)", type=["pdf"])
    job_description = st.text_area("Enter the job description:")

    if uploaded_file and job_description:
        # Extract text from resume
        resume_text = extract_text_from_pdf(uploaded_file)
        resume_text = preprocess_text(resume_text)

        # Extract entities
        entities = extract_entities(resume_text, model, tokenizer)

        # Calculate similarity score
        similarity_score = calculate_similarity(resume_text, job_description)

        # Display extracted text
        st.subheader("Extracted Resume Text")
        st.write(resume_text)

        # Display extracted entities
        st.subheader("Extracted Entities")
        entity_dict = {}
        for entity_type, entity_value in entities:
            if entity_type not in entity_dict:
                entity_dict[entity_type] = []
            entity_dict[entity_type].append(entity_value)

        for entity_type, values in entity_dict.items():
            st.write(f"**{entity_type}**: {', '.join(values)}")

        # Highlight matching and missing skills
        job_skills = set(job_description.lower().split())
        resume_skills = set(resume_text.lower().split())
        matching_skills = job_skills.intersection(resume_skills)
        missing_skills = job_skills.difference(resume_skills)

        st.subheader("Matching Skills")
        st.write(", ".join(matching_skills))

        st.subheader("Missing Skills")
        st.write(", ".join(missing_skills))

        # Display similarity score
        st.subheader("Resume Score")
        st.write(f"{similarity_score}% match with the job description")

elif option == "Batch Resume Ranking":
    st.header("Batch Resume Ranking")
    uploaded_files = st.file_uploader("Upload resumes (PDF)", type=["pdf"], accept_multiple_files=True)
    job_description = st.text_area("Enter the job description:")

    if uploaded_files and job_description:
        scores = []

        for uploaded_file in uploaded_files:
            # Extract text from resume
            resume_text = extract_text_from_pdf(uploaded_file)
            resume_text = preprocess_text(resume_text)

            # Calculate similarity score
            similarity_score = calculate_similarity(resume_text, job_description)
            scores.append((uploaded_file.name, similarity_score))

        # Sort resumes by score
        scores.sort(key=lambda x: x[1], reverse=True)

        # Display ranked resumes
        st.subheader("Ranked Resumes")
        for i, (filename, score) in enumerate(scores):
            st.write(f"{i + 1}. **{filename}**: {score}% match with the job description")

Explanation of Features

**1. Single Resume Parsing**
File Upload: Users can upload a single PDF resume.
Job Description Input: Users provide a job description in a text box.
Output:
Extracted Resume Text: Displays the raw text extracted from the resume.
Extracted Entities: Lists all named entities (e.g., Name, Title, Skills) captured by the NER model.
Matching Skills: Highlights skills in the resume that match the job description.
Missing Skills: Lists skills mentioned in the job description but missing from the resume.
Resume Score: Provides a similarity score (percentage) between the resume and job description.

**2. Batch Resume Ranking**
File Upload: Users can upload multiple PDF resumes.
Job Description Input: Users provide a job description in a text box.

Output:
Ranked Resumes: Displays a ranked list of resumes based on their similarity scores with the job description.