# 📜 Project: Job Description Analyzer – Extracting Required Skills from Job Postings


## 📌 Objective
Use spaCy’s Named Entity Recognition (NER) and NLTK preprocessing to extract and categorize required skills from job descriptions. The goal is to identify trends in job requirements and analyze the most in-demand skills across industries.

## 🛠️ Project Steps & Instructions


In [None]:
#📥 Download the Dataset
!wget https://raw.githubusercontent.com/binoydutt/Resume-Job-Description-Matching/refs/heads/master/data.csv

--2025-06-23 13:12:22--  https://raw.githubusercontent.com/binoydutt/Resume-Job-Description-Matching/refs/heads/master/data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 646072 (631K) [text/plain]
Saving to: ‘data.csv’


2025-06-23 13:12:22 (18.6 MB/s) - ‘data.csv’ saved [646072/646072]



### Step 1: Load the Dataset
#### 📌 Dataset: A provided CSV file containing job descriptions from different industries (IT, Healthcare, Finance, Marketing, etc.).

1. Download the dataset (link below).
2. Load it into Python using Pandas.
3. View the first few rows to understand its structure.

In [None]:
import pandas as pd


df = pd.read_csv('data.csv')

df.head()

Unnamed: 0.1,Unnamed: 0,company,position,url,location,headquaters,employees,founded,industry,Job Description
0,1,Visual BI Solutions Inc,Graduate Intern (Summer 2017) - SAP BI / Big D...,https://www.glassdoor.com/partner/jobListing.h...,"Plano, TX","Plano, TX",51 to 200 employees,2010,Information Technology,"Location: Plano, TX or Oklahoma City, OK Dura..."
1,2,Jobvertise,Digital Marketing Manager,https://www.glassdoor.com/partner/jobListing.h...,"Dallas, TX","Berlin, Germany",1 to 50 employees,2011,Unknown,The Digital Marketing Manager is the front li...
2,3,Santander Consumer USA,"Manager, Pricing Management Information Systems",https://www.glassdoor.com/partner/jobListing.h...,"Dallas, TX","Dallas, TX",5001 to 10000 employees,1995,Finance,Summary of Responsibilities:The Manager Prici...
3,4,Federal Reserve Bank of Dallas,Treasury Services Analyst Internship,https://www.glassdoor.com/partner/jobListing.h...,"Dallas, TX","Dallas, TX",1001 to 5000 employees,1914,Finance,ORGANIZATIONAL SUMMARY: As part of the nati...
4,5,Aviall,"Intern, Sales Analyst",https://www.glassdoor.com/partner/jobListing.h...,"Dallas, TX","Dallas, TX",1001 to 5000 employees,Boeing,Subsidiary or Business Segment,Aviall is the world's largest provider of n...


### Step 2: Preprocessing the Job Descriptions
#### 📌 Goal: Clean the text by removing stopwords, punctuation, and unnecessary characters.

1. Use NLTK to tokenize the descriptions.
2. Remove stopwords and special characters.
3. Convert text to lowercase for consistency.

In [None]:
import nltk
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('punkt_tab')
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
punctuations = string.punctuation

#so we want to actually create a function that doe the following : => tokenize Job descriptions , remove stopwords and special characters
# and convert text to lowercase

def preprocess_text(text):
   #convert text to lowecase
   tokens = word_tokenize(text.lower())
   #remove any kinds of punctuation or special aphbet characters
   tokens = [word for word in tokens if word.isalpha()]
   #tokens
   filteredtokens = [word for word in tokens if word not in stop_words]
     # paying close attention here is key as there should be a seperation or distance among the quotes
   return ' '.join(filteredtokens)

df['Cleaned_Description'] = df['Job Description'].apply(preprocess_text)

print(df[['Job Description', 'Cleaned_Description']].head())




[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


                                     Job Description  \
0   Location: Plano, TX or Oklahoma City, OK Dura...   
1   The Digital Marketing Manager is the front li...   
2   Summary of Responsibilities:The Manager Prici...   
3   ORGANIZATIONAL SUMMARY:   As part of the nati...   
4     Aviall is the world's largest provider of n...   

                                 Cleaned_Description  
0  location plano tx oklahoma city ok duration in...  
1  digital marketing manager front line patient c...  
2  summary responsibilities manager pricing mis r...  
3  organizational summary part nation central ban...  
4  aviall world largest provider new aviation par...  


### Step 3: Extract Skills Using Named Entity Recognition (NER)
#### 📌 Goal: Use spaCy’s built-in NER to detect and extract skills from job descriptions.

1. Load spaCy’s English model.
2. Use NER to identify important keywords.
3. Extract words related to technical skills, tools, and expertise.

In [None]:
import spacy
from spacy.matcher import Matcher

# Load the model
nlp = spacy.load("en_core_web_sm")

# Create a Matcher object and attach it to the vocab
matcher = Matcher(nlp.vocab)

# Define skill patterns
skill_patterns = [
    [{"LOWER": "analytics"}],
    [{"LOWER": "marketing"}],
    [{"LOWER": "bachelor"}],
    [{"LOWER": "Access"}],
    [{"LOWER": "Word"}],
    [{"LOWER": "sql"}],
    [{"LOWER": "javascript"}],
    [{"LOWER": "GPA "}],
    [{"LOWER": "data"}, {"LOWER": "analysis"}],
]
matcher.add("SKILLS", skill_patterns)

# Function to extract skills
def extract_combined_skills(text):
    doc = nlp(text)
    labeled_entities = [ent.text for ent in doc.ents if ent.label_ in ["ORG", "LANGUAGE"]]
    matched_skills = [doc[start:end].text for _, start, end in matcher(doc)]
    return list(set(labeled_entities + matched_skills))

# Apply to DataFrame
df['Extracted_Skills'] = df['Cleaned_Description'].apply(extract_combined_skills)
print(df[['Job Description', 'Extracted_Skills']].head())



                                     Job Description  \
0   Location: Plano, TX or Oklahoma City, OK Dura...   
1   The Digital Marketing Manager is the front li...   
2   Summary of Responsibilities:The Manager Prici...   
3   ORGANIZATIONAL SUMMARY:   As part of the nati...   
4     Aviall is the world's largest provider of n...   

                                    Extracted_Skills  
0                                  [gpa scores, sql]  
1                               [digital, marketing]  
2                                              [sql]  
3  [marketing, dallas treasury services departmen...  
4                                                 []  


### Step 4: Identify the Most In-Demand Skills
#### 📌 Goal: Count the most frequently mentioned skills in job descriptions.

1. Create a word frequency distribution of extracted skills.
2. Identify the top 10 most required skills.

In [None]:
from nltk.probability import FreqDist

all_skills = [skill for skills in df['Extracted_Skills'] for skill in skills]

freq_dist = FreqDist(all_skills)


top_10 = freq_dist.most_common(10)
for skill, freq in top_10:
    print(f"{skill}: {freq} times")



marketing: 44 times
microsoft: 34 times
sql: 33 times
microsoft office: 23 times
data analysis: 15 times
english: 10 times
deloitte: 9 times
javascript: 8 times
texas usa: 7 times
ibm: 7 times


### Step 5: Categorize Skills by Industry
#### 📌 Goal: Compare the most in-demand skills across different industries.

1. Group job descriptions by industry.
2. Extract and analyze skills for each industry.
3. Compare IT vs. Marketing vs. Healthcare, etc..

In [None]:
from nltk.probability import FreqDist

industry_skill_freqs = {}

for industry, group in df.groupby('industry') :
 second_allSkills = [skill for skills  in group['Extracted_Skills'] for skill in skills]

 freqDist_second = FreqDist(second_allSkills)

 industry_skill_freqs[industry] = freqDist_second


for industry in ["Information Technology", "Finance", "Subsidiary or Business Segment"]:
    print(f"\nTop skills in {industry}:")
    for skill, freq in industry_skill_freqs[industry].most_common(5):
        print(f"  {skill}: {freq}")




Top skills in Information Technology:
  microsoft: 7
  sql: 6
  texas usa: 6
  marketing: 6
  microsoft office: 2

Top skills in Finance:
  marketing: 10
  sql: 7
  dallas treasury services department regularly apply analytical problem: 3
  federal reserve bank: 3
  data analysis: 3

Top skills in Subsidiary or Business Segment:
  microsoft office: 6
  marketing: 3
  gpa education college coursework: 2
  javascript: 1
  sql: 1


In [3]:
#Uppercase all strings in a list
names = ["alice", "bob", "charlie"]

for name in names :
  print(name.upper())

upper_case_names = [name.upper() for name in names]

print(upper_case_names)



['ALICE', 'BOB', 'CHARLIE']
