## **Automated Resume Data Extraction (Name, Email and Phone Numbers) using NLP**

### Problem Statement

In today’s competitive job market, organizations receive thousands of resumes for various job openings. These resumes often come in unstructured formats (text, Word, PDF), making it challenging to extract and organize candidate details efficiently. Manually scanning and processing this information is time-consuming, error-prone, and not scalable.  

The objective of this project is to develop an automated **Named Entity Recognition (NER) system** capable of identifying and extracting key candidate information such as **Name, Email Address, and Phone Number** from resumes in different formats. By leveraging Natural Language Processing (NLP) techniques, this system will streamline the resume screening process, improve data organization, and serve as a foundation for building intelligent recruitment solutions.  


### Step 1: Install Required Libraries and SpaCy Model


In [35]:
# Install required packages (run once)
!pip install pandas spacy python-docx PyPDF2 openpyxl
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ----------- ---------------------------- 3.7/12.8 MB 19.7 MB/s eta 0:00:01
     ------------------- -------------------- 6.3/12.8 MB 16.1 MB/s eta 0:00:01
     --------------------------- ------------ 8.9/12.8 MB 15.2 MB/s eta 0:00:01
     ----------------------------------- --- 11.8/12.8 MB 14.6 MB/s eta 0:00:01
     ---------------------------------------- 12.8/12.8 MB 13.9 MB/s  0:00:00
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


### Step 2: Importing Required Libraries

In [36]:
import os
import re
import pandas as pd
import spacy
from docx import Document
from PyPDF2 import PdfReader

### Step 3: Load NLP model

In [37]:
# Load the pre-trained small English NLP model from spaCy 
# (used for tokenization, POS tagging, and Named Entity Recognition)
nlp = spacy.load("en_core_web_sm")

### Step 4: Directory containing resumes

In [38]:
# Path to the folder where all resume files (PDF/DOCX/Word) are stored
resume_dir = "Resumes formats" 

### Step 5: Regular Expression Patterns for Extracting Contact Details

In [39]:
# Regex patterns
name_pattern = re.compile(
    r"\b(?!Curriculum Vitae|Resume|RESUME|C.V.|Name|Full Name)[A-Z][a-zA-Z]+(?:\s[A-Z][a-zA-Z]+){0,2}\b"
)

phone_pattern = re.compile(r'(\+?\d{1,3}[-.\s]?)?(\d{10}|\d{3}[-.\s]\d{3}[-.\s]\d{4})')
email_pattern = re.compile(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}')


### Step 6: Function to Clean Extracted Names

In [40]:
# This function removes extra spaces, newlines, and common labels from a name string

def clean_name(name):
    # Remove newlines, tabs, and extra spaces
    name = re.sub(r'\s+', ' ', name).strip()
    # Remove common labels
    name = re.sub(r'\b(Name|Full Name|Email|E-mail|Phone|Contact)\b', '', name, flags=re.IGNORECASE).strip()
    return name


### Step 7: Functions to Extract Text and Contact Details from Resumes

In [41]:
# 1. extract_text_from_file: Reads text content from .txt, .pdf, and .docx files
# 2. extract_details: Extracts name, phone number, and email using regex and spaCy NER

# Function to read text from files
def extract_text_from_file(filepath):
    ext = os.path.splitext(filepath)[1].lower()
    text = ""
    if ext == ".txt":
        with open(filepath, "r", encoding="utf-8", errors="ignore") as f:
            text = f.read()
    elif ext == ".pdf":
        reader = PdfReader(filepath)
        for page in reader.pages:
            if page.extract_text():
                text += page.extract_text() + "\n"
    elif ext == ".docx":
        doc = Document(filepath)
        text = "\n".join([para.text for para in doc.paragraphs])
    return text

# Function to extract details
# More robust regex for phone numbers
phone_pattern = re.compile(
    r'(?:\+?\d{1,3}[-.\s]?)?(?:\(?\d{2,4}\)?[-.\s]?){1,3}\d{3,4}[-.\s]?\d{3,4}'
)

def extract_details(text):
    # Extract phone numbers
    phone_matches = phone_pattern.findall(text)
    phones = []
    for p in phone_matches:
        # Keep only numbers with at least 10 digits total
        digits = re.sub(r'\D', '', p)
        if len(digits) >= 10:
            phones.append(p.strip())
    phone = ", ".join(sorted(set(phones)))

    # Extract emails
    email_matches = email_pattern.findall(text)
    email = ", ".join(sorted(set(email_matches))) if email_matches else ""

    # Extract name
    name_match = name_pattern.search(text)
    if name_match:
        name = clean_name(name_match.group())
    else:
        doc = nlp(text)
        name = ""
        for ent in doc.ents:
            if ent.label_ == "PERSON" and len(ent.text.split()) <= 3:
                name = ent.text.strip()
                break

    return name, phone, email


### Step 8: Process Resumes and Extract Information

In [42]:
# Process all resumes
# Iterates through all resume files in the directory, extracts text, and retrieves name, phone, and email

data = []
for file in os.listdir(resume_dir):
    if file.lower().endswith(('.txt', '.pdf', '.docx')):
        path = os.path.join(resume_dir, file)
        text = extract_text_from_file(path)
        name, phone, email = extract_details(text)
        data.append({"File": file, "Name": name, "Phone": phone, "Email": email})


### Step 9: Save Extracted Resume Details to Excel

In [43]:
# Save to Excel
df = pd.DataFrame(data)
df.to_excel("Output_Resume_details.xlsx", index=False)

print("Extraction completed. Data saved to 'Output_Resume_details.xlsx'.")

Extraction completed. Data saved to 'Output_Resume_details.xlsx'.


### Step 10: Display Extracted Resume Data

In [44]:
# Shows the DataFrame containing names, phone numbers, and emails extracted from resumes
print(df)

            File             Name            Phone                      Email
0   resume_1.txt      Aarav Mehta   +91-9988776655    aarav.mehta23@gmail.com
1  resume_10.pdf       Hannah Lee   (917) 555-4421    hannah.lee23@icloud.com
2   resume_2.txt    Sophia Turner  +1-646-555-0198  sophia.turner@outlook.com
3   resume_3.txt      Rohan Gupta   +91-9123456789    rohan.gupta07@yahoo.com
4   resume_4.txt     Emily Carter   (312) 555-2299   emily.carter12@gmail.com
5  resume_5.docx       Arjun Nair   +91-9871122334  arjun.nair@protonmail.com
6  resume_6.docx  Olivia Martinez  +1-213-555-7812   olivia.martinez@mail.com
7  resume_7.docx      Karan Singh   +91-9812233445  karan.singh.dev@gmail.com
8   resume_8.pdf    Grace Johnson  +1-408-555-1244  grace.johnson88@yahoo.com
9   resume_9.pdf     Nikhil Verma   +91-9977886655   nikhil.verma@hotmail.com



### Project Insights: 


1. **Automated Information Extraction**  
   - The NER model successfully extracted **Name, Email, and Phone Number** from resumes across formats (TXT, DOCX, PDF).  
   - This validates the ability of NLP to convert unstructured resume data into structured information.  

2. **Format Independence**  
   - The pipeline performed consistently across multiple formats.  
   - This demonstrates that the solution is **robust and adaptable** to real-world resumes in varied file types.  

3. **Improved Efficiency**  
   - Manual resume screening is time-intensive.  
   - Automation reduces effort, minimizes errors, and **accelerates candidate filtering**.  

4. **Scalability**  
   - The system can scale to process **large volumes of resumes** with minimal additional effort.  
   - It also provides a foundation for extracting more complex details such as **skills, education, and job experience**.  

5. **Practical Applicability**  
   - The project illustrates a **real-world application of NLP in recruitment systems**.  
   - Such a solution can be integrated into **Applicant Tracking Systems (ATS)** to enhance hiring efficiency.  
