<a href="https://colab.research.google.com/github/harshelke180502/Resume_Parser/blob/main/Resume_Parser.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install pdfplumber spacy python-docx nltk
!pip install sentence-transformers
!python -m spacy download en_core_web_sm

In [None]:
from google.colab import files
data = files.upload()


In [None]:
import pdfplumber
def extract_text_from_pdf(pdf_path):
  text=" "
  with pdfplumber.open(pdf_path) as pdf:
    for page in pdf.pages:
      text+=page.extract_text()

  return text



In [None]:
resume_text=extract_text_from_pdf("/content/Harsh_Shelke (1) (2).pdf")

In [None]:
print(resume_text)

In [None]:
import re
def clean_text(text):
  text=re.sub(r'\s+',' ',text)
  return text.strip()

cleaned_text=clean_text(resume_text)





In [None]:
cleaned_text

In [None]:
import spacy
NLP=spacy.load("en_core_web_sm")
doc=NLP(cleaned_text)
def extract_entity(doc):
  entities={"PERSON":[], "ORG":[], "DATE":[], "GPE":[]}
  for ent in doc.ents:
    if ent.label_ in entities:
      entities[ent.label_].append(ent.text)
  return entities

entities=extract_entity(doc)
entities



In [None]:
def extract_contact_info(text):
  email_id=re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',text)
  phone_number=re.findall(r'\d{3}-\d{3}-\d{4}',text)
  linkedin=re.findall(r'linkedin.com/in/[a-zA-Z0-9-]+',text)
  github=re.findall(r'github.com/[a-zA-Z0-9-]+',text)
  return {
      "email_id":email_id[0] if email_id else None,
      "phone_number":phone_number[0] if phone_number else None,
      "linkedin":linkedin[0] if linkedin else None,
      "github":github[0] if github else None
  }

In [None]:
contact=extract_contact_info(cleaned_text)

In [None]:
def extract_eductaion(text,section_name):
  pattern=re.compile(rf"{section_name}(.+?)(?=(Education|Experience|Technical Skills|Projects|$))", re.IGNORECASE | re.DOTALL)
  match=pattern.search(text)
  return match.group(1).strip() if match else None
education_txt=extract_eductaion(cleaned_text,"Education")
experience_txt=extract_eductaion(cleaned_text,"Experience")
skills_txt=extract_eductaion(cleaned_text,"Technical Skills")
projects_txt=extract_eductaion(cleaned_text,"Projects")



In [None]:
skills_txt



In [None]:
experience_txt

In [None]:
projects_txt

In [None]:
parsed_resume = {
    "name": entities["PERSON"][0] if entities["PERSON"] else None,
    "email": contact["email_id"],
    "phone": contact["phone_number"],
    "linkedin": contact["linkedin"],
    "github": contact["github"],
    "education": education_txt,
    "experience": experience_txt,
    "technical_skills": skills_txt,
    "projects": projects_txt

}

import json
print(json.dumps(parsed_resume, indent=4))

In [None]:
!pip install keybert
!pip install sentence-transformers





## 🔧 **Importing Libraries**

```python
import pdfplumber
import re
import spacy
from keybert import KeyBERT
from sentence_transformers import SentenceTransformer, util
```

### 📌 What These Do:

* **`pdfplumber`**: Reads and extracts text from PDF files **page-by-page**.
* **`re`**: Python’s **regular expression** module used for text pattern matching and cleaning.
* **`spacy`**: An NLP library used for **text preprocessing**, named entity recognition (NER), etc.
* **`KeyBERT`**: A keyword extraction model that uses **BERT-style embeddings**.
* **`SentenceTransformer`**: Provides pre-trained transformer models for **sentence embeddings**.
* **`util`**: Utilities for comparing embeddings, like **cosine similarity**.

---

## 🔍 **Model Initialization**

```python
model = SentenceTransformer("all-MiniLM-L6-v2")
nlp = spacy.load("en_core_web_sm")
kw_model = KeyBERT(model='all-MiniLM-L6-v2')
```

### ✅ Purpose:

You're loading **pre-trained models** for different NLP tasks:

* **`SentenceTransformer` model**: Converts any sentence or phrase into a **384-dimensional vector**. Useful for **semantic similarity**.

  ✅ Example:

  ```python
  model.encode("Machine learning with TensorFlow")  
  → returns a dense vector: [0.03, -0.01, 0.12, ..., 0.06]
  ```

* **`spacy.load("en_core_web_sm")`**: Loads SpaCy's **small English model**, which supports:

  * Tokenization
  * Part-of-speech tagging
  * Named entity recognition (NER)

* **`KeyBERT(...)`**: Uses the same transformer model (`MiniLM`) to extract the **most representative phrases** from any block of text.

---

## 📄 **PDF Text Extraction**

```python
def extract_text_from_pdf(pdf_path):
    text = ''
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            page_text = page.extract_text()
            if page_text:
                text += page_text + '\n'
    return text.strip()
```

### ✅ What This Does:

* Reads a PDF file **page-by-page**
* Extracts **text content** from each page
* Concatenates all the page texts into one string
* Returns the cleaned, final **document text**

### 🧪 Example:

Suppose `resume.pdf` has 2 pages:

* Page 1:

  ```
  Harsh Shelke
  Skills: Python, Flask, TensorFlow
  ```
* Page 2:

  ```
  Projects:
  - Diabetic Retinopathy Detection
  - Tour Package Prediction
  ```

Then calling:

```python
extract_text_from_pdf("resume.pdf")
```

Will return:

```text
"Harsh Shelke\nSkills: Python, Flask, TensorFlow\nProjects:\n- Diabetic Retinopathy Detection\n- Tour Package Prediction"
```

---

## 🧹 **Basic Cleanup Function**

```python
def clean_text(text):
    return re.sub(r'\s+', ' ', text).strip()
```

### ✅ Purpose:

* Removes **extra spaces**, tabs, or newlines from the text.
* Makes everything into **one uniform block of text** with single spaces between words.

### 🔍 Regex Explanation:

* `\s+`: Matches **one or more** whitespace characters (space, tab, newline).
* `re.sub(..., ' ', ...)`: Replaces them all with a single space.
* `.strip()`: Removes extra space from **start and end** of the text.

### 🧪 Example:

```python
text = "Harsh   Shelke\n\nSkills:    Python,\n\tFlask"
clean_text(text)
```

Returns:

```text
"Harsh Shelke Skills: Python, Flask"
```

---

## 📑 **Section Extraction Function**

```python
def extract_section(text, section_name):
    pattern = re.compile(rf'{section_name}(.+?)(?=(Education|Work Experience|Techinal Skills|Projects|Certifications|$))',
                         re.IGNORECASE | re.DOTALL)
    match = pattern.search(text)
    return match.group(1).strip() if match else ""
```

### ✅ Purpose:

To **extract the text under a specific section heading** (like `"Projects"`, `"Education"`, etc.) from the resume.

---

### 🔍 Regex Breakdown:

Let’s say `section_name = "Projects"`

```regex
Projects(.+?)(?=(Education|Work Experience|Techinal Skills|Projects|Certifications|$))
```

* `Projects`: Starting keyword (passed dynamically)
* `(.+?)`: **Non-greedy match** of all content after the section heading.
* `(?=...)`: **Lookahead** to stop matching when the next section begins:

  * `Education`, `Work Experience`, `Certifications`, etc.

Also uses:

* `re.IGNORECASE`: Case-insensitive matching
* `re.DOTALL`: So `.` can match **newline characters** too.

---

### 🧪 Example Resume Text:

```text
Education
MIT World Peace University, B.Tech CS

Projects
- Heart Failure Prediction using XGBoost
- Resume Parser using Python

Certifications
AWS Cloud Practitioner
```

Calling:

```python
extract_section(text, "Projects")
```

Returns:

```text
"- Heart Failure Prediction using XGBoost\n- Resume Parser using Python"
```

---

## ✅ Summary Table

| Function                | What It Does                     | Example Input                          | Example Output                |
| ----------------------- | -------------------------------- | -------------------------------------- | ----------------------------- |
| `extract_text_from_pdf` | Extracts text from all PDF pages | PDF file                               | Text content of entire resume |
| `clean_text`            | Removes excessive spacing        | `"Hello\n  World"`                     | `"Hello World"`               |
| `extract_section`       | Gets content under a heading     | `"Projects\n- A\n- B\nCertifications"` | `"- A\n- B"`                  |



In [None]:
import pdfplumber
import re
import spacy
from keybert import KeyBERT
from sentence_transformers import SentenceTransformer, util

# Load models
model = SentenceTransformer("all-MiniLM-L6-v2")
nlp = spacy.load("en_core_web_sm")
kw_model = KeyBERT(model='all-MiniLM-L6-v2')

# ---------- PDF Text Extraction ----------
def extract_text_from_pdf(pdf_path):
    text = ''
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            page_text = page.extract_text()
            if page_text:
                text += page_text + '\n'
    return text.strip()

# ---------- Basic Cleanup ----------
def clean_text(text):
    return re.sub(r'\s+', ' ', text).strip()

# ---------- Section Extraction ----------
def extract_section(text, section_name):
    pattern = re.compile(rf'{section_name}(.+?)(?=(Education|Work Experience|Techinal Skills|Projects|Certifications|$))',
                         re.IGNORECASE | re.DOTALL)
    match = pattern.search(text)
    return match.group(1).strip() if match else ""

In [None]:
text=extract_text_from_pdf("/content/Harsh_Shelke (1) (2).pdf")

In [None]:
text

In [None]:
cleaned_text=clean_text(text)
cleaned_text

In [None]:
extract_section(cleaned_text,"Projects")



---

## 🧠 Part 1: **Automatic Skill Extraction**

```python
def extract_skills_auto(text, top_n=15):
    keywords = kw_model.extract_keywords(
        text,
        keyphrase_ngram_range=(1, 3),
        stop_words='english',
        top_n=top_n
    )
    return [kw[0] for kw in keywords]
```

### ✅ Purpose:

To automatically extract the **top `n` most relevant skills or key phrases** from a block of text (typically from a resume or job description) using **KeyBERT**, which is a BERT-based keyword extraction tool.

### ⚙️ How It Works:

* **`kw_model.extract_keywords`** uses BERT-based embeddings to find keywords/phrases that are **semantically representative** of the text.
* **`keyphrase_ngram_range=(1, 3)`**: Extract **1 to 3-word** phrases like:

  * "machine learning"
  * "data analysis"
  * "deep learning model"
* **`stop_words='english'`**: Removes common filler words like “and,” “is,” “the.”
* **`top_n=15`**: Limits the output to the **15 most important phrases**.

### 🧪 Example:

#### Input `text` (from resume):

```
I have experience with Python, machine learning, TensorFlow, Keras, and scikit-learn. I built models for classification, regression, and time series forecasting.
```

#### Output:

```python
['machine learning', 'TensorFlow', 'scikit-learn', 'time series forecasting', 'regression']
```

This is useful for:

* Auto-tagging resumes with **extracted skills**
* Matching resumes to job descriptions
* Showing **strength areas** of a candidate

---

## 🧠 Part 2: **Semantic Match with Job Description**

```python
def calculate_similarity(text1, text2):
    emb1 = model.encode(text1, convert_to_tensor=True)
    emb2 = model.encode(text2, convert_to_tensor=True)
    score = util.pytorch_cos_sim(emb1, emb2).item()
    return round(score * 100, 2)
```

### ✅ Purpose:

To **quantify how similar** two pieces of text are — e.g., a resume and a job description — based on their **semantic meaning**, **not just keywords**.

### ⚙️ How It Works:

* Uses a `SentenceTransformer` model (e.g., MiniLM) to convert both texts into **dense vector embeddings**.
* Then computes **cosine similarity** between them.
* Converts it into a **percentage (0 to 100)** to indicate how close the match is.

### 🧪 Example:

#### Input:

```python
resume_text = "Experienced with React, Flask, and Docker for building full-stack apps."
job_description = "We are hiring a developer with strong knowledge of Flask and API development."
```

#### Output:

```python
calculate_similarity(resume_text, job_description)
# ➜ 85.67
```

This is very helpful for:

* Ranking applicants
* Filtering resumes
* Visualizing job fit scores

---

## 🧠 Part 3: **Degree, Company & College Keyword Lists**

```python
DEGREE_KEYWORDS = [
    "Bachelor", "Bachelors", "B.Tech", "B.E", "BE", "BS", "BSc",
    "Master", "M.Tech", "M.E", "MS", "MSc", "PhD", "Diploma"
]
COMPANY_KEYWORDS = ["Technologies", "Solutions", "Labs", "Systems", "Inc", "LLC", "Ltd", "Corporation"]
COLLEGE_KEYWORDS = ["Institute", "University", "College", "School", "Academy"]
```

### ✅ Purpose:

These are **reference keyword lists** used for rule-based **named entity extraction**, especially when identifying:

1. **Degrees**: Useful for extracting education qualifications from resumes.
2. **Company names**: Helps detect work experience affiliations.
3. **College/institution names**: Helps isolate educational institutions.

### 🧪 Examples:

#### 📘 DEGREE\_KEYWORDS

Resume line:

```
Completed B.Tech in Computer Science from MIT WPU.
```

→ Detected: `"B.Tech"` → **degree**

#### 🏢 COMPANY\_KEYWORDS

Resume line:

```
Worked at Turing Technologies as a backend intern.
```

→ Contains `"Technologies"` → **company**

#### 🏫 COLLEGE\_KEYWORDS

Resume line:

```
Graduated from Stanford University with an MS in AI.
```

→ Detected: `"University"` → **educational institute**

These keyword lists can be used in your parsing logic like:

```python
for word in text.split():
    if any(degree in word for degree in DEGREE_KEYWORDS):
        print("Found a degree:", word)
```

Or with Spacy's NER to **post-process and label entities** more accurately using these keyword hints.

---

## ✅ Final Summary:

| Component                          | Purpose                                          | Example Output                                                    |
| ---------------------------------- | ------------------------------------------------ | ----------------------------------------------------------------- |
| `extract_skills_auto(text)`        | Extract top 15 skill-like phrases                | `['machine learning', 'scikit-learn', 'time series forecasting']` |
| `calculate_similarity(resume, jd)` | Measure match % between resume & job description | `86.23`                                                           |
| `DEGREE_KEYWORDS`, etc.            | Help detect education, companies, and colleges   | `"B.Tech"`, `"Technologies"`, `"University"`                      |

---



In [None]:
# ---------- Automatic Skill Extraction ----------
def extract_skills_auto(text, top_n=15):
    keywords = kw_model.extract_keywords(text, keyphrase_ngram_range=(1, 3), stop_words='english', top_n=top_n)
    return [kw[0] for kw in keywords]

# ---------- Semantic Match with JD ----------
def calculate_similarity(text1, text2):
    emb1 = model.encode(text1, convert_to_tensor=True)
    emb2 = model.encode(text2, convert_to_tensor=True)
    score = util.pytorch_cos_sim(emb1, emb2).item()
    return round(score * 100, 2)

# ---------- DEGREE + ORGANIZATION ENHANCED EXTRACTION ----------
DEGREE_KEYWORDS = [
    "Bachelor", "Bachelors", "B.Tech", "B.E", "BE", "BS", "BSc",
    "Master", "M.Tech", "M.E", "MS", "MSc", "PhD", "Diploma"
]
COMPANY_KEYWORDS = ["Technologies", "Solutions", "Labs", "Systems", "Inc", "LLC", "Ltd", "Corporation"]
COLLEGE_KEYWORDS = ["Institute", "University", "College", "School", "Academy"]



## 🔧 How `kw_model.extract_keywords()` Uses Embeddings

At the heart of **KeyBERT** is the idea of **semantic similarity** — it uses **sentence embeddings** to determine how relevant a candidate keyword is to the whole text.

### ✅ Here’s what happens under the hood:

1. **Step 1: Convert the full input text into an embedding** (using a model like `all-MiniLM-L6-v2`).
2. **Step 2: Generate candidate phrases** (1–3 words) using n-gram extraction.
3. **Step 3: Embed each candidate phrase** the same way.
4. **Step 4: Compute cosine similarity** between the full-text embedding and each phrase embedding.
5. **Step 5: Rank the phrases** by similarity score — highest scoring phrases are most representative.

---

## 🧠 Example Breakdown

### Input:

```text
"I have experience with Python, machine learning, TensorFlow, Keras, and scikit-learn. I built models for classification, regression, and time series forecasting."
```

---

### 🔹 Step 1: Convert the full text into an embedding vector

```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
text_embedding = model.encode(text)  # Output: a 384-dimensional vector
```

The embedding might look like:

```python
[0.035, -0.112, 0.208, ..., 0.076]  # size: (384,)
```

This is a **dense semantic representation** of the entire resume text.

---

### 🔹 Step 2: Extract candidate phrases (using n-grams)

From the text, KeyBERT may extract candidates like:

```
["Python", "machine learning", "TensorFlow", "regression", "classification", "time series forecasting", "scikit-learn", "built models"]
```

These are **1- to 3-word phrases** extracted using simple syntactic methods (not yet embeddings).

---

### 🔹 Step 3: Convert each candidate phrase into its own embedding

```python
phrase = "machine learning"
phrase_embedding = model.encode(phrase)
```

Output (again a 384-dim vector):

```python
[0.051, -0.097, 0.201, ..., 0.089]
```

---

### 🔹 Step 4: Compare phrase embedding to full-text embedding using cosine similarity

```python
from sentence_transformers import util
score = util.cos_sim(phrase_embedding, text_embedding)
```

This will yield a score like:

```python
tensor([[0.93]])
```

Which means: "machine learning" is **very similar** to the main idea of the resume.

---

### 🔹 Step 5: Rank and return top N keywords

KeyBERT sorts all phrases by similarity score and returns:

```python
['machine learning', 'TensorFlow', 'scikit-learn', 'time series forecasting', 'regression']
```

---

## 🔍 Why Embeddings Matter Here

* Instead of looking for **exact matches** (like regex), embeddings let you **understand meaning**.
* Even if the resume says "time series model" and the job says "forecasting", the **semantic link** is detected.

---

## 🧰 Summary Table

| Stage                | Input                | Output                 |
| -------------------- | -------------------- | ---------------------- |
| Sentence Embedding   | Full resume text     | Dense vector (384 dim) |
| Candidate Extraction | n-gram phrases (1–3) | List of phrases        |
| Phrase Embedding     | Each phrase          | 384-dim vector         |
| Similarity Score     | Phrase vs Text       | Cosine similarity      |
| Final Output         | Ranked phrases       | Top-N relevant skills  |


