# AI Resume Parser (Ntk)

In [9]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import PyPDF2
import re

import nltk: This imports the NLTK library, which stands for Natural Language Toolkit. NLTK is a widely used library in Python for natural language processing tasks.

from nltk.tokenize import word_tokenize: This line imports the word_tokenize function from the nltk.tokenize module. word_tokenize is used to split a text into individual words or tokens.

from nltk.corpus import stopwords: This line imports the stopwords module from the nltk.corpus package. Stopwords are common words (e.g., "a", "an", "the") that are often removed from text as they do not carry significant meaning.

from nltk.stem import WordNetLemmatizer: This line imports the WordNetLemmatizer class from the nltk.stem module. Lemmatization is the process of reducing words to their base or dictionary form (e.g., "running" to "run").

import PyPDF2: This imports the PyPDF2 library, which provides functionality for working with PDF files in Python.

import re: This imports the regular expression (regex) module in Python. Regular expressions are used for pattern matching and text manipulation.

In [10]:
# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True



nltk.download('punkt')
This line downloads the "punkt" resource, which includes pre-trained models and data for tokenization. Tokenization is the process of breaking text into individual words or tokens. The "punkt" resource provides trained models for tokenizing text in multiple languages.


nltk.download('stopwords')
This line downloads the "stopwords" resource, which consists of commonly used words that are often removed from text during text processing tasks. These words, such as "the," "is," "and," etc., typically do not carry significant meaning and can be disregarded in certain NLP (Natural Language Processing) applications.


nltk.download('wordnet')
This line downloads the "wordnet" resource, which is a large lexical database of English words. WordNet provides synsets (sets of synonyms) and definitions for words, along with relationships between words like hypernyms (superordinate terms) and hyponyms (subordinate terms). It is widely used in various NLP tasks such as word sense disambiguation, semantic similarity, and more.

nltk.download('omw-1.4')
This line downloads the "omw-1.4" resource, which stands for Open Multilingual WordNet. It is a multilingual extension of WordNet and provides synsets and word relationships for several languages other than English.

By executing these nltk.download() statements, you ensure that the required NLTK resources are available for your code to work properly. This step is typically done once, at the beginning of a project, to download the necessary resources before utilizing NLTK functionalities.

In [11]:
def preprocess_text(text):
    # Tokenize the text
    tokens = word_tokenize(text)
    
    # Convert to lowercase
    tokens = [token.lower() for token in tokens]
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    
    # Lemmatize the tokens
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    # Return the preprocessed text
    return tokens

Tokenization: The text is tokenized using the word_tokenize function from the NLTK library. Tokenization is the process of splitting the text into individual words or tokens.

Lowercasing: Each token is converted to lowercase using a list comprehension. This step helps in standardizing the text and treating words in a case-insensitive manner.

Stopword removal: Stopwords are common words that do not carry much meaning and are often removed from the text to reduce noise. The code retrieves a set of stopwords for the English language using the stopwords.words('english') function from NLTK. It then filters out the stopwords from the tokenized text using another list comprehension.

Lemmatization: Lemmatization is the process of reducing words to their base or root form. It helps in normalizing words that have the same meaning but different forms (e.g., "running" and "ran" both become "run"). The code initializes a WordNetLemmatizer object from the NLTK library and applies lemmatization to each token using a list comprehension.

Return: The preprocessed tokens are returned as the output of the function.

In [12]:
def extract_text_from_pdf(file_path):
    with open(file_path, 'rb') as file:
        pdf_reader = PyPDF2.PdfReader(file)
        num_pages = len(pdf_reader.pages)
        resume_text = ""
        
        for page_number in range(num_pages):
            page = pdf_reader.pages[page_number]
            resume_text += page.extract_text()
        
        return resume_text

The function opens the PDF file in binary mode using the open function with the mode 'rb' (read binary). The file is opened within a context manager, denoted by the with statement. This ensures that the file is properly closed after it's processed, even if an exception occurs.

The PdfReader class from the PyPDF2 library is used to read the PDF file. It takes the file object as an argument and creates a PdfReader object named pdf_reader.

The variable num_pages is assigned the number of pages in the PDF file, which is obtained by calling the len function on pdf_reader.pages. This gives the total count of pages in the PDF document.

The variable resume_text is initialized as an empty string. It will be used to store the extracted text from the PDF.

A loop is set up to iterate over each page in the PDF. It uses the range function to generate a sequence of page numbers from 0 to num_pages - 1. For each page, the code retrieves the corresponding Page object from pdf_reader.pages using the current page_number.

The extract_text method is called on the Page object to extract the text content from that page. The extracted text is then concatenated with the existing resume_text string.

After all the pages have been processed, the function returns the accumulated resume_text, which contains the extracted text from the entire PDF document.

In [13]:

def calculate_similarity(text1, text2):
    # Create frequency distributions for the tokens
    freq_dist1 = nltk.FreqDist(text1)
    print(freq_dist1)
    freq_dist2 = nltk.FreqDist(text2)
    print(freq_dist2)
    
    # Calculate the Jaccard similarity coefficient
    common_tokens = set(text1).intersection(set(text2))
    similarity = len(common_tokens) / (len(text1) + len(text2) - len(common_tokens))
    
    return similarity


The function begins by creating frequency distributions for the tokens in text1 and text2 using the FreqDist class from the NLTK library. The FreqDist object counts the occurrences of each token in the text. The resulting frequency distributions are assigned to the variables freq_dist1 and freq_dist2 respectively. These frequency distributions are useful for later calculations and analysis.

The code then prints the frequency distributions using print(freq_dist1) and print(freq_dist2). This can be helpful for understanding the token frequencies in each text, but it is not necessary for the similarity calculation.

The Jaccard similarity coefficient is calculated next. The Jaccard similarity is a measure of the similarity between two sets, in this case, the sets of tokens in text1 and text2. The code first finds the set of common tokens by taking the intersection of the sets created from text1 and text2. The set() function is used to convert the lists of tokens into sets. The result is stored in the common_tokens variable.

The similarity is then calculated using the formula: similarity = len(common_tokens) / (len(text1) + len(text2) - len(common_tokens)). The length of the common_tokens set represents the number of tokens that are common to both texts. The denominator is the total number of tokens in both texts, subtracted by the number of common tokens to avoid counting them twice. The resulting value represents the Jaccard similarity coefficient.

Finally, the similarity value is returned as the output of the function.

In [14]:

job_description = input("Enter the job description")
resume_file_path =input("Enter the Resume")


Enter the job descriptionfruitseller with communication skills
Enter the ResumeProfessional Resume.pdf


job_description = input("Enter the job description"): This line displays the message "Enter the job description" to the user and waits for input. The user can enter a description of a job, such as the requirements, responsibilities, or qualifications. The inputted value is then stored in the job_description variable.

resume_file_path = input("Enter the Resume"): This line displays the message "Enter the Resume" to the user and waits for input. The user is expected to provide the path to a resume file, which is typically a document in a specific format (e.g., PDF, Word) containing their professional information, skills, and experience. The inputted value, representing the file path, is stored in the resume_file_path variable.

In [15]:
preprocessed_job = preprocess_text(job_description)
print("preprocessed_job ..fetched...")
print("job_description fetched...")
resume_text = extract_text_from_pdf(resume_file_path)
preprocessed_resume = preprocess_text(resume_text)
print("precessing-resume... done...")
skills_pattern = r"(\b[\w\s&]+)\b\s?(?:\b[\w\s&]+?\b\s?){0,3}\b(?:skills|proficient in|expert in|knowledge in)\b"
experience_pattern = r"(\d+)\s?(?:year[s]?)?\s?(?:of)?\s?(?:experience)?"


preprocessed_job ..fetched...
job_description fetched...
precessing-resume... done...


preprocessed_job = preprocess_text(job_description): The preprocess_text function is called, passing the job_description variable as the input. This function preprocesses the text by tokenizing it, converting it to lowercase, removing stopwords, and lemmatizing the tokens. The preprocessed result is stored in the preprocessed_job variable.

print("preprocessed_job ..fetched..."): This line simply prints the message "preprocessed_job ..fetched..." to the console, indicating that the preprocessed job description has been obtained.

print("job_description fetched..."): This line prints the message "job_description fetched..." to the console, indicating that the original job description input has been obtained.

resume_text = extract_text_from_pdf(resume_file_path): The extract_text_from_pdf function is called, passing the resume_file_path variable as the input. This function reads the content of the PDF file located at the specified path and extracts the text from it. The extracted text is stored in the resume_text variable.

preprocessed_resume = preprocess_text(resume_text): The preprocess_text function is called, passing the resume_text variable as the input. This function preprocesses the resume text in a similar way to the job description, by tokenizing, lowercasing, removing stopwords, and lemmatizing the tokens. The preprocessed result is stored in the preprocessed_resume variable.

print("precessing-resume... done..."): This line prints the message "precessing-resume... done..." to the console, indicating that the preprocessing of the resume text is complete.

skills_pattern = r"(\b[\w\s&]+)\b\s?(?:\b[\w\s&]+?\b\s?){0,3}\b(?:skills|proficient in|expert in|knowledge in)\b": This line defines a regular expression pattern stored in the skills_pattern variable. The pattern is designed to match skill-related phrases in a text, such as "skills," "proficient in," "expert in," or "knowledge in."

experience_pattern = r"(\d+)\s?(?:year[s]?)?\s?(?:of)?\s?(?:experience)?": This line defines another regular expression pattern stored in the experience_pattern variable. The pattern is designed to match experience-related phrases in a text, such as "3 years of experience."

In [17]:
skills = re.findall(skills_pattern, resume_text, re.IGNORECASE)
experience = re.findall(experience_pattern, resume_text, re.IGNORECASE)
print("Experiance.. fetched...")
print("Skills parsed..")
similarity = calculate_similarity(preprocessed_job, preprocessed_resume)
print("Percentage of Compentency for the role",similarity)
if similarity > 0.01:
    chosen_candidate = "Candidate is suitable"
else:
    chosen_candidate = "Candidate is not suitable"

print("The chosen candidate is:", chosen_candidate)

Experiance.. fetched...
Skills parsed..
<FreqDist with 3 samples and 3 outcomes>
<FreqDist with 157 samples and 276 outcomes>
Percentage of Compentency for the role 0.007220216606498195
The chosen candidate is: Candidate is not suitable


skills = re.findall(skills_pattern, resume_text, re.IGNORECASE): The re.findall() function is called with the skills_pattern, resume_text, and the re.IGNORECASE flag. This function searches for all matches of the skills_pattern regular expression in the resume_text and returns a list of matched skills. The resulting list of skills is stored in the skills variable.

experience = re.findall(experience_pattern, resume_text, re.IGNORECASE): Similar to the previous step, the re.findall() function is called with the experience_pattern, resume_text, and the re.IGNORECASE flag. This function searches for all matches of the experience_pattern regular expression in the resume_text and returns a list of matched experiences. The resulting list of experiences is stored in the experience variable.

print("Experiance.. fetched..."): This line prints the message "Experiance.. fetched..." to the console, indicating that the experiences from the resume text have been obtained.

print("Skills parsed.."): This line prints the message "Skills parsed..." to the console, indicating that the skills from the resume text have been parsed.

similarity = calculate_similarity(preprocessed_job, preprocessed_resume): The calculate_similarity function is called with the preprocessed job description (preprocessed_job) and preprocessed resume text (preprocessed_resume) as inputs. This function calculates the similarity between the two texts using the Jaccard similarity coefficient, and the resulting similarity value is stored in the similarity variable.

print("Percentage of Compentency for the role", similarity): This line prints the message "Percentage of Compentency for the role" followed by the calculated similarity value to the console, providing information about the similarity between the job description and the resume.

if similarity > 0.001:: This line starts an if statement, comparing the similarity value with 0.001. If the similarity is greater than this threshold, the following block of code will be executed.

chosen_candidate = "Candidate is suitable": This line assigns the string "Candidate is suitable" to the chosen_candidate variable.

else:: This line starts an else block, which will be executed if the condition in the if statement is not met.

chosen_candidate = "Candidate is not suitable": This line assigns the string "Candidate is not suitable" to the chosen_candidate variable.

print("The chosen candidate is:", chosen_candidate): This line prints the message "The chosen candidate is:" followed by the value of the chosen_candidate variable, indicating whether the candidate is considered suitable or not based on the similarity comparison.