## An In-depth Exploration of US Job Market by analyzing LinkedIn Job Postings using Natural Language Processing Techniques

## Data ingestion and preprocesing
We firstly need to read the webscraped dataset and explore its contents before we begin the analysis.

In [4]:
# importing necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import spacy
import re

Matplotlib is building the font cache; this may take a moment.


In [6]:
# read the dataset
data = pd.read_csv('/Users/juliabarsow/Desktop/thesis/project_code/postings.csv')

In [10]:
data.shape

(123849, 31)

In [11]:
#check how many descriptions are missing as this is the most important column
print("Missing rows of description: ",data['description'].isnull().sum())

#drop rows with missing descriptions
data = data.dropna(subset=['description'])

Missing rows of description:  7


In [35]:
list(data['title'].unique())

['Marketing Coordinator',
 'Mental Health Therapist/Counselor',
 'Assitant Restaurant Manager',
 'Senior Elder Law / Trusts and Estates Associate Attorney',
 ' Service Technician',
 'Economic Development and Planning Intern',
 'Producer',
 'Building Engineer',
 'Respiratory Therapist',
 'Worship Leader',
 'Inside Customer Service Associate',
 'Project Architect',
 "Appalachian Highlands Women's Business Center",
 'Structural Engineer',
 'Senior Product Marketing Manager',
 'Osteogenic Loading Coach',
 'Administrative Coordinator',
 'Customer Service / Reservationist',
 'Content Writer, Communications',
 'Controller',
 'Physician Assistant',
 'Licensed Acupuncturist',
 'Software Engineer',
 'Sheet Metal Fabricator',
 'Personal Injury Attorney',
 'NPE 2024 Exhibition Event Worker',
 'Loan Coordinator',
 'General Laborer',
 'Swim Instructor',
 'Administrative Assistant',
 'Service / Construction Technician',
 'Legal Secretary',
 'Salesperson',
 'Registered Nurse',
 'Marketing & Office Coo

In [42]:
# List of titles to count
titles_to_count = [
    'Full Stack Engineer',
    'Computer Scientist',
    'Front end specialist',
    'Project Engineer',
    'Data Architect',
    'Project Manager',
    'Java architect / Lead Java developer',
    'Enterprise Data & Analytics Infrastructure Manager',
    'Senior Software Engineer',
    'Web Developer',
    'Software Implementation Program Manager',
    'Vice President - Engineering & Production Operation',
    'Test Engineer',
    'Sr Software Engineer',
    'IT QA Engineer II',
    'Sr. Business Analyst/Tester',
    'Senior Developer – React Native',
    'Cloud DevOps Engineer',
    'Senior Analyst, Data & Analytics',
    'Senior Business Analyst',
    'Engineering Project Manager / Project Manager',
    'Java full Stack Engineer'
]

# Count occurrences of each title
title_counts = data['title'].value_counts()

# Filter counts for the specified titles
filtered_counts = title_counts[title_counts.index.isin(titles_to_count)]

# Display the counts
print(filtered_counts)

title
Project Manager                                        354
Senior Software Engineer                               162
Project Engineer                                       102
Full Stack Engineer                                     57
Web Developer                                           43
Senior Business Analyst                                 34
Data Architect                                          27
Test Engineer                                           24
Sr Software Engineer                                     6
Cloud DevOps Engineer                                    3
Senior Developer – React Native                          3
Software Implementation Program Manager                  2
Computer Scientist                                       1
Front end specialist                                     1
Vice President - Engineering & Production Operation      1
Java architect / Lead Java developer                     1
Enterprise Data & Analytics Infrastructure Manager

In [12]:
# ✅ Calculate missing values, available count, and percentage
missing_values = data.isnull().sum().to_frame(name='missing_count')
missing_values['available_count'] = len(data) - missing_values['missing_count']
missing_values['missing_percentage'] = (missing_values['missing_count'] / len(data)) * 100  # Keep as float

# ✅ Reorder columns for better readability (available_count first)
missing_values = missing_values[['available_count', 'missing_count', 'missing_percentage']]

# ✅ Sort by missing percentage (ascending for better features on top)
missing_values = missing_values.sort_values(by='missing_percentage', ascending=True)

# ✅ Reset index and rename it to "column_name"
missing_values = missing_values.reset_index().rename(columns={'index': 'column_name'})

# ✅ Apply clear and meaningful color formatting
styled_missing_values = (
    missing_values.style
    .background_gradient(subset=['available_count'], cmap='Greens')  # More available → Green (good)
    .background_gradient(subset=['missing_count'], cmap='Oranges_r')  # More missing → Darker Orange (bad)
    .background_gradient(subset=['missing_percentage'], cmap='Reds_r')  # Higher missing % → Darker Red (bad)
    .format({'missing_percentage': "{:.2f}%"})  # Format as percentage AFTER styling
)

# ✅ Display dataset size
print(f"The dataset size: {data.shape[0]} rows")

# ✅ Display missing values table with improved color usage
display(styled_missing_values)

The dataset size: 123842 rows


Unnamed: 0,column_name,available_count,missing_count,missing_percentage
0,job_id,123842,0,0.00%
1,work_type,123842,0,0.00%
2,sponsored,123842,0,0.00%
3,listed_time,123842,0,0.00%
4,expiry,123842,0,0.00%
5,application_type,123842,0,0.00%
6,original_listed_time,123842,0,0.00%
7,formatted_work_type,123842,0,0.00%
8,job_posting_url,123842,0,0.00%
9,title,123842,0,0.00%


As we can see, there's a significant amount of missing data, however we will drop columns for every usecase we have

preprocessing to perform:
1. lowercasing
2. noise removal -> removing punctuations, emoticon, hashtags, accent marks or diacritics, extra white spaces, special characters, digits (could be useful for sentiment analysis though!!!!)
3. stop word removal -> removing stop words, sparse terms, and particular words. You can use already existing stop words lists or you can create a custom one for your use case.
4. tokenization -> breaking it down into smaller, minimal, meaningful units to work with. It enables to analyze each element in context of the other elements
5. lemmatization/stemming -> reduce the words to their root forms, reducing the bias introduced by the inflection, Lemmatization transforms words to the actual root. You have to know the POS of the word to get the correct lemma. The root form, in the stemming case, is a truncated one: stemming is a process that chops off the ends of words.
6. token enrichment -> POS tagging -> gives a mark to words based on the part-of speech they are, such as nouns, verbs and adjectives

## Skill Extraction and Clustering

### Objective: Extract required skills from job descriptions and cluster them to identify common skill sets across industries.

NLP Techniques: Named Entity Recognition (NER), Topic Modeling, or Clustering algorithms.

Research Questions:
○ What are the most in-demand skills across different sectors?
○ How do skill requirements differ by salary range or job title?
○ What are the salary ranges in different sectors and job positions?


## Preprocessing the dataset for skill extraction

In [None]:
# Load the small English spaCy model.
# This model includes lemmatization, stop words, and tokenization capabilities.
try:
    nlp = spacy.load("en_core_web_trf")
except OSError:
    print("Downloading 'en_core_web_trf' model...")
    from spacy.cli import download
    download("en_core_web_trf")
    nlp = spacy.load("en_core_web_trf")

Downloading 'en_core_web_sm' model...
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m29.5 MB/s[0m  [33m0:00:00[0meta [36m0:00:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
def preprocess_text(text: str) -> str:
    """
    Performs a series of text preprocessing steps:
    1. Removes special characters and numbers.
    2. Converts text to lowercase.
    3. Tokenizes the text.
    4. Removes stop words.
    5. Lemmatizes the tokens.

    Args:
        text: The raw text string to be preprocessed.

    Returns:
        The preprocessed and cleaned text string.
    """
    # Check if the input is a valid string. If not, return an empty string.
    if not isinstance(text, str):
        return ""

    # 1. Remove special characters, punctuation, and numbers.
    # We'll keep spaces and letters.
    text = re.sub(r'[^a-zA-Z\s]', '', text, re.I|re.A)

    # Convert to a spaCy Doc object for efficient processing.
    doc = nlp(text)

    # 2. Convert to lowercase, remove leading/trailing whitespace, and perform. Note - conversion to lowercase is handled implicitly by spaCy during lemmatization
    #    stop word removal and lemmatization.
    # A list comprehension is used for efficiency.
    tokens = [
        token.lemma_ for token in doc
        if not token.is_stop and not token.is_punct and token.is_alpha
    ]

    # Join the processed tokens back into a single string.
    return " ".join(tokens)


In [None]:
def preprocess_job_descriptions(df: pd.DataFrame, column_name: str) -> pd.DataFrame:
    """
    Applies the text preprocessing function to a specified column in a DataFrame.

    Args:
        df: The pandas DataFrame containing the raw data.
        column_name: The name of the column with the job description text.

    Returns:
        A new DataFrame with a 'preprocessed_text' column.
    """
    # Ensure the specified column exists in the DataFrame.
    if column_name not in df.columns:
        print(f"Error: Column '{column_name}' not found in DataFrame.")
        return df

    # Apply the preprocess_text function to each entry in the column.
    print("Preprocessing job descriptions...")
    data = df.copy()
    data['preprocessed_text'] = data[column_name].apply(preprocess_text)
    print("Preprocessing complete!")

    return data

In [70]:
# making a copy of the dataset for further processing
# Random Sampling
# The dataframe contains 123,849 rows, embedding all rows will lead to excessive computational cost for this demo. We will select 10000 rows for job postings.
df = data.sample(10000, random_state=42).copy()
# keeping only the relevant columns for skill extraction
df = df[['job_id', 'description']]

print("Original DataFrame:")
print(df)
print("\n" + "-"*30 + "\n")

# Preprocess the 'description' column.
preprocessed_df = preprocess_job_descriptions(df, 'description')

# Print the resulting DataFrame to see the preprocessed text.
print("Preprocessed DataFrame:")
print(preprocessed_df)

Original DataFrame:
            job_id                                        description
73457   3902931205  Warehouse Associate\n\nSince 2022, Associated ...
31563   3894855473  Interested in joining our influencer talent ma...
41620   3899525675  Talent Specialist - Hybrid Role Must be locate...
10762   3887106339  The ideal Salesperson is passionate about fash...
30334   3894555810  1126363_RR00090444 Job ID: 1126363_RR00090444\...
...            ...                                                ...
106731  3905328899  Job Description\n\n INTEGRIS Health Baptist Me...
29871   3894540767  About Raise Commercial Real Estate\n\nRaise is...
50360   3901372681  Job SummaryThe Paid Ads Manager will play an i...
84824   3904362789  Mechanical Engineer – Design Support (National...
51656   3901391118  Do you want to work for the global leader in t...

[10000 rows x 2 columns]

------------------------------

Preprocessing job descriptions...


  text = re.sub(r'[^a-zA-Z\s]', '', text, re.I|re.A)


KeyboardInterrupt: 

## Skill extraction
We perform skill extraction using a pre-trained SpaCy model and we will later finetune it on our own curated labeled data to enhance accuracy

In [17]:
def extract_skills(text: str) -> list:
    """
    Extracts entities from text using a pre-trained spaCy NER model.

    Args:
        text: The raw or preprocessed job description text.

    Returns:
        A list of extracted strings that are identified as potential skills.
    """
    if not isinstance(text, str):
        return []

    # Process the text with the spaCy NLP pipeline
    doc = nlp(text)

    # Simple filtering for potential skills.
    # This is a heuristic approach, as 'en_core_web_sm' doesn't have a 'SKILL' label.
    # We will primarily look for common nouns or noun phrases that might represent skills.
    
    # You can also iterate through `doc.ents` for a more explicit list of entities
    # recognized by the default model (e.g., ORG, GPE, DATE).
    
    # For a more robust approach, you'll need to create a list of skills to match
    # or fine-tune a model as discussed previously.
    
    # A simple example: extracting proper nouns and compound nouns
    skills = []
    for chunk in doc.noun_chunks:
        # Filter for phrases that are likely to be skills
        # This is a very basic filter; it will need to be refined.
        if "data" in chunk.text.lower() or "learning" in chunk.text.lower(): # TODO: Expand this list - make filter more comprehensive
            skills.append(chunk.text.strip())

    return list(set(skills)) # Return unique skills

In [None]:
def process_dataframe_for_skills(df: pd.DataFrame, text_column: str) -> pd.DataFrame:
    """
    Applies skill extraction to a DataFrame column.

    Args:
        df: The pandas DataFrame.
        text_column: The name of the column containing the job description text.

    Returns:
        A DataFrame with a new column 'extracted_skills'.
    """
    print("Starting skill extraction...")
    data = df.copy()
    data['extracted_skills'] = data[text_column].apply(extract_skills)
    print("Skill extraction complete!")
    return data

In [19]:
# Apply the skill extraction function.
extracted_skills_df = process_dataframe_for_skills(preprocessed_df, 'description')

print("DataFrame with Extracted Skills:")
print(extracted_skills_df)

Starting skill extraction...
Skill extraction complete!
DataFrame with Extracted Skills:
            job_id                                        description  \
73457   3902931205  Warehouse Associate\n\nSince 2022, Associated ...   
31563   3894855473  Interested in joining our influencer talent ma...   
41620   3899525675  Talent Specialist - Hybrid Role Must be locate...   
10762   3887106339  The ideal Salesperson is passionate about fash...   
30334   3894555810  1126363_RR00090444 Job ID: 1126363_RR00090444\...   
...            ...                                                ...   
106731  3905328899  Job Description\n\n INTEGRIS Health Baptist Me...   
29871   3894540767  About Raise Commercial Real Estate\n\nRaise is...   
50360   3901372681  Job SummaryThe Paid Ads Manager will play an i...   
84824   3904362789  Mechanical Engineer – Design Support (National...   
51656   3901391118  Do you want to work for the global leader in t...   

                                  

### As we can see the model without any finetuning is not extracting skills well. We will begin the finetuning process by preparing a custom training dataset with the labels we want to define.

In [None]:
# preprocessed_df.to_csv('preprocessed_job_postings.csv', index=False)
preprocessed_df[['job_id', 'preprocessed_text']].to_csv('preprocessed_job_postings_transformer.csv', index=False)

In [43]:
preprocessed_df.columns

Index(['job_id', 'description', 'preprocessed_text', 'extracted_skills'], dtype='object')

In [44]:
extracted_skills_df.columns

Index(['job_id', 'description', 'preprocessed_text', 'extracted_skills'], dtype='object')

## Finetuning the SpaCy NLP model

In [52]:
import spacy
from spacy.tokens import DocBin
from tqdm import tqdm
# import random
from sklearn.model_selection import train_test_split

# Step 1: Prepare the training data
# This is a small sample. For your thesis, you will need a much larger dataset.
# The format is a tuple: (text, {"entities": [(start_char, end_char, "LABEL")]})
# "start_char" and "end_char" are the character indices of the entity in the text.
train_data = [
    ("We are looking for a data scientist with expertise in Python.", {"entities": [(47, 53, "SKILL")]}),
    ("Experience with cloud computing platforms like Amazon Web Services (AWS) is a plus.", {"entities": [(42, 63, "SKILL"), (65, 68, "SKILL")]}),
    ("Needs strong project management skills.", {"entities": [(11, 29, "SKILL")]}),
    ("Seeking a software engineer with strong programming skills in Java and C++.", {"entities": [(49, 53, "SKILL"), (58, 61, "SKILL")]}),
    ("Knowledge of Google Cloud is a plus.", {"entities": [(13, 25, "SKILL")]}),
    ("Proficiency in JavaScript, HTML, and CSS is required.", {"entities": [(16, 26, "SKILL"), (28, 32, "SKILL"), (38, 41, "SKILL")]}),
    ("Familiarity with React and Angular frameworks is a bonus.", {"entities": [(20, 25, "SKILL"), (30, 37, "SKILL")]}),
]

# Split the data into training and development sets
train_data, dev_data = train_test_split(train_data, test_size=0.2, random_state=42)

# Step 2: Convert data to spaCy's format
# This is an efficient way to store your data for training.
def convert_data_to_docbin(data, output_path):
    """Converts training data to a spaCy DocBin file."""

    nlp = spacy.blank("en")
    # ^ Creates a blank English NLP pipeline. We don't need a full model 
    # here (like en_core_web_lg), just the tokenization rules.

    db = DocBin()
    # ^ Initializes the empty DocBin container where all the converted 
    # documents will be stored.

    for text, annotations in tqdm(data, desc="Creating DocBin"):
        # `data` is your list of (text, annotations) tuples.
        # `tqdm` just adds a progress bar (the 'Creating DocBin' part)
        
        doc = nlp.make_doc(text)
        ents = []
        for start, end, label in annotations['entities']:
            span = doc.char_span(start, end, label=label)
            # ^ CRUCIAL STEP: This creates a Span object for the entity. 
            # It uses the start/end *character indices* to find the corresponding 
            # *tokens* in the `doc` and assigns the label ("SKILL").

            if span is not None:
                ents.append(span)
                # ^ If spaCy successfully created the span (meaning the indices 
                # correctly align with token boundaries), add it to the list.

        doc.ents = ents
        # ^ Assigns the newly created list of entities (`ents`) to the document.

        db.add(doc)
        # ^ Adds the fully annotated document (`doc`) to the DocBin container (`db`).

    db.to_disk(output_path)
    # ^ Saves the entire DocBin container to the specified file path (e.g., "train.spacy").

# Run the function to create your training data file
convert_data_to_docbin(train_data, "train.spacy")
convert_data_to_docbin(dev_data, "dev.spacy")

Creating DocBin: 100%|██████████| 5/5 [00:00<00:00, 1365.78it/s]
Creating DocBin: 100%|██████████| 2/2 [00:00<00:00, 2885.66it/s]


## Step 3: Train the model
You will need to run the following command in your terminal to create the config file.
spacy init config --lang en --pipeline ner --optimize efficiency config.cfg

Then, you can run the training from the command line:
spacy train config.cfg --output ./output --paths.train ./train.spacy --verbose

In [53]:
# Step 3: Train the model
# You will need to run the following command in your terminal to create the config file or run this cell in notebook
# spacy init config --lang en --pipeline ner --optimize efficiency config.cfg

!python -m spacy init config --lang en --pipeline ner --optimize efficiency config.cfg


[38;5;1m✘ The provided output file already exists. To force overwriting the
config file, set the --force or -F flag.[0m



In [62]:
# run the training from the command line:
# spacy train config.cfg --output ./output --paths.train ./train.spacy --verbose
# or run this notebook cell

!python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --verbose

[2025-10-01 13:00:02,329] [DEBUG] Config overrides from CLI: ['paths.train']
[38;5;2m✔ Created output directory: output[0m
[38;5;4mℹ Saving to output directory: output[0m
[38;5;4mℹ Using CPU[0m
[1m
[2025-10-01 13:00:02,500] [INFO] Set up nlp object from config
[2025-10-01 13:00:02,508] [DEBUG] Loading corpus from path: dev.spacy
[2025-10-01 13:00:02,508] [DEBUG] Loading corpus from path: train.spacy
[2025-10-01 13:00:02,509] [INFO] Pipeline: ['tok2vec', 'ner']
[2025-10-01 13:00:02,511] [INFO] Created vocabulary
[2025-10-01 13:00:02,511] [INFO] Finished initializing nlp object

Load the table in your config with:

[initialize.lookups]
@misc = "spacy.LookupsDataLoader.v1"
lang = ${nlp.lang}
tables = ["lexeme_norm"]

[2025-10-01 13:00:02,632] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[2025-10-01 13:00:02,641] [DEBUG] Loading corpus from path: dev.spacy
[2025-10-01 13:00:02,642] [DEBUG] Loading corpus from path: train.spacy


In [55]:
# Step 4: Use the newly fine-tuned model
# After training, your new model will be in the 'output/model-best' directory.
# You can load it and use it to extract skills.
def test_fine_tuned_model(model_path, new_text):
    """Loads a trained spaCy model and tests it on new text."""
    try:
        nlp_fine_tuned = spacy.load(model_path)
        print(f"\nSuccessfully loaded fine-tuned model from '{model_path}'")
        doc = nlp_fine_tuned(new_text)
        print(f"\nText: {new_text}")
        print("Extracted Skills:")
        for ent in doc.ents:
            if ent.label_ == "SKILL":
                print(f"- {ent.text} (Label: {ent.label_})")
    except OSError:
        print(f"\nError: Model not found at '{model_path}'.")
        print("Please run 'spacy init config' and 'spacy train' as instructed in the comments.")

In [56]:
test_fine_tuned_model("./output/model-best", "I am a skilled web developer with experience in Vue.js and Svelte.")


Successfully loaded fine-tuned model from './output/model-best'

Text: I am a skilled web developer with experience in Vue.js and Svelte.
Extracted Skills:


In [58]:
MODEL_PATH = "./output/model-best"

def load_skill_extractor(model_path: str):
    """
    Loads the trained spaCy model. If the fine-tuned model is not available, 
    it falls back to a general pre-trained model for demonstration.
    """
    try:
        # 1. Try to load the custom fine-tuned model
        nlp = spacy.load(model_path)
        print(f"Successfully loaded fine-tuned model from '{model_path}'.")
        return nlp
    except OSError:
        # 2. Fallback if the custom model hasn't been trained yet
        print(f"Custom model not found at '{model_path}'.")
        print("Falling back to 'en_core_web_lg' for general entity recognition.")
        try:
            nlp = spacy.load("en_core_web_lg")
        except OSError:
            # Download if the large model is missing
            print("Downloading 'en_core_web_lg' model. This may take a moment...")
            from spacy.cli import download
            download("en_core_web_lg")
            nlp = spacy.load("en_core_web_lg")
        return nlp

In [59]:
def extract_skills(nlp_model, text: str) -> list:
    """
    Extracts entities labeled as 'SKILL' from text using the loaded spaCy model.
    If the model is the fallback 'en_core_web_lg', it will extract general entities 
    (like PERSON, ORG) which might contain some skills, but won't use the SKILL label.
    """
    if not isinstance(text, str):
        return []

    doc = nlp_model(text)
    skills = []

    # Check if the 'SKILL' label exists in the model's pipeline
    is_fine_tuned = nlp_model.get_pipe("ner").has_label("SKILL") if "ner" in nlp_model.pipe_names else False
    
    for ent in doc.ents:
        if is_fine_tuned and ent.label_ == "SKILL":
            # Use the custom SKILL label from your fine-tuned model
            skills.append(ent.text)
        elif not is_fine_tuned and (ent.label_ in ['ORG', 'PRODUCT', 'MISC']):
            # If using the fallback model, look at general entities that often 
            # catch technology (e.g., Python, AWS are sometimes classified as PRODUCT/ORG/MISC)
            skills.append(ent.text)
        
    return sorted(list(set(skills)))

In [60]:
def process_dataframe_for_skills(df: pd.DataFrame, text_column: str, nlp_model) -> pd.DataFrame:
    """
    Applies the skill extraction function to a DataFrame column.
    """
    if text_column not in df.columns:
        print(f"Error: Column '{text_column}' not found in DataFrame.")
        return df

    print("Starting skill extraction...")
    # Use a lambda function to pass the model to the extractor
    df['extracted_skills'] = df[text_column].apply(lambda x: extract_skills(nlp_model, x))
    print("Skill extraction complete!")
    return df


In [63]:
nlp_skill_extractor = load_skill_extractor(MODEL_PATH)
    
sample_data = {
    'job_description': [
        "We are looking for a data scientist with expertise in Python, machine learning, and SQL databases. Experience with cloud computing platforms like Amazon Web Services (AWS) is a plus. Needs strong project management skills.",
        "Seeking a software engineer with strong programming skills in Java and C++. Needs a background in web development and agile methodologies. Knowledge of Google Cloud is a plus."
    ]
}
sample_df = pd.DataFrame(sample_data)

print("\n" + "-"*30)
print("Sample Data Processing")
print("-"*30)

# 3. Apply the skill extraction function.
extracted_skills_df = process_dataframe_for_skills(sample_df, 'job_description', nlp_skill_extractor)

print("\nDataFrame with Extracted Skills:")
print(extracted_skills_df[['job_description', 'extracted_skills']])

Successfully loaded fine-tuned model from './output/model-best'.

------------------------------
Sample Data Processing
------------------------------
Starting skill extraction...


AttributeError: 'spacy.pipeline.ner.EntityRecognizer' object has no attribute 'has_label'

## Labeling our original data
The model did not find any skills from the sample dataset of 7 rows, so we will label our original dataset of 1000 entries to train the model.
If additional training data won't result in finding the labels successfully we will use a transformer

In [65]:
# Install label-studio open source platform for data labeling
!python -m pip install label-studio



## Start label-studio in terminal using this command:
label-studio start

In [None]:
# Start Label Studio in terminal:
# label-studio start

/Users/juliabarsow/Desktop/thesis/project_code/NLP_Job_Postings/.venv/bin/python: No module named label-studio


## Trial and errors

In [29]:
# undo the pd set option display max colwidth option
pd.set_option('display.max_colwidth', 50)
# print first 5 fows of description form bot df and preprocessed_df
print("Original Descriptions (first 5 rows):")
print(df['description'].head())
print("\nPreprocessed Descriptions (first 5 rows):")
print(preprocessed_df['preprocessed_text'].head())

Original Descriptions (first 5 rows):
73457    Warehouse Associate\n\nSince 2022, Associated ...
31563    Interested in joining our influencer talent ma...
41620    Talent Specialist - Hybrid Role Must be locate...
10762    The ideal Salesperson is passionate about fash...
30334    1126363_RR00090444 Job ID: 1126363_RR00090444\...
Name: description, dtype: object

Preprocessed Descriptions (first 5 rows):
73457    Warehouse Associate Associated Materials Alsid...
31563    interested join influencer talent management t...
41620    Talent Specialist Hybrid Role locate Greenvill...
10762    ideal Salesperson passionate fashion styling a...
30334    RR Job ID RR NYU Langone HospitalLong Island b...
Name: preprocessed_text, dtype: object


In [11]:
#loading the spacy model (could give explanation why i chose en_core_web_sm model)
nlp = spacy.load("en_core_web_sm")

In [12]:
# removing the unnecessary columns
columns = ['job_id', 'title', 'description']
data1 = data[columns]
data1.shape

(123849, 3)

In [13]:
data1['title'].value_counts()

title
Sales Manager                                       673
Customer Service Representative                     373
Project Manager                                     354
Administrative Assistant                            254
Senior Accountant                                   238
                                                   ... 
Cath Lab / IR Technologist (Cert) - Cardiac Cath      1
Energy Administrative Assistant Part Time             1
ASSOCIATE CLIENT SUCCESS MANAGER                      1
Student Nurse - Telemetry                             1
Marketing Social Media Specialist                     1
Name: count, Length: 72521, dtype: int64

In [14]:
data1.head()

Unnamed: 0,job_id,title,description
0,921716,Marketing Coordinator,Job descriptionA leading real estate firm in N...
1,1829192,Mental Health Therapist/Counselor,"At Aspen Therapy and Wellness , we are committ..."
2,10998357,Assitant Restaurant Manager,The National Exemplar is accepting application...
3,23221523,Senior Elder Law / Trusts and Estates Associat...,Senior Associate Attorney - Elder Law / Trusts...
4,35982263,Service Technician,Looking for HVAC service tech with experience ...


In [15]:
# Sampling 5 data points
job_descriptions = data['description'].dropna().sample(5).tolist()
# Preprocessing function
def preprocess(text):
    text = text.lower()  # Lowercase
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove punctuation and numbers
    doc = nlp(text)
    # tokenization and lemmatization (ex. running -> run)
    tokens = [token.lemma_ for token in doc if token.is_alpha and not token.is_stop and token.pos_ in {"NOUN", "PROPN"}]
    return tokens

In [16]:
# Apply preprocessing
processed_descriptions = [preprocess(desc) for desc in job_descriptions]

# Show results
for i, (raw, processed) in enumerate(zip(job_descriptions, processed_descriptions)):
    print(f"--- Job Description {i+1} ---")
    print("Original:", raw)
    print("------------------------------------------------------------------")
    print("Processed:", processed)
    print()

--- Job Description 1 ---
Original: Join an amazing team that is consistently recognized for our achievements and culture, including our most recent Forbes award of being one of America's Best Midsize Employers for 2024!

Position Summary

If you’re passionate about helping people restore their lives when the unexpected happens to their homes and providing the best customer experience, then our Mercury Insurance Property Claims team could be the place for you!

Upon completion of the training program, ideal candidates will transition into a property claims field adjusting position traveling to loss sites that have been damaged by fire, water, weather, or other unexpected events. You may also handle some claims via virtual technology and/or collaborate with vendors.

The Property Claims Field Adjuster ll will learn apply knowledge of current Company policies, applicable regulatory standards, and procedures to investigate, evaluate and settle moderate Homeowner's property claims in a tim

In [17]:
import re
from spacy import displacy

# Sample 5 job descriptions and drop NaN values
job_descriptions = data['description'].dropna().sample(5).tolist()

# Join the list into a single string
text = " ".join(job_descriptions).lower()  # Lowercase the combined text

# Remove punctuation and numbers
text = re.sub(r'[^a-zA-Z\s]', '', text)

# Process the cleaned text with spaCy
doc = nlp(text)

# Display named entities (if any)
displacy.render(doc, style="ent")


In [18]:
processed_descriptions = [preprocess(desc) for desc in job_descriptions]

In [19]:
entities = [(ent.text, ent.label_, ent.lemma_) for ent in doc.ents]
df = pd.DataFrame(entities, columns=['text', 'type', 'lemma'])
print(df)

                      text     type                  lemma
0                    third  ORDINAL                  third
1                one years     DATE               one year
2             twelve hours     TIME            twelve hour
3                 annually     DATE               annually
4                   weekly     DATE                 weekly
5                  monthly     DATE                monthly
6                    years     DATE                   year
7            massachusetts      GPE          massachusetts
8                   annual     DATE                 annual
9     ten to fifteen years     DATE    ten to fifteen year
10                     iep      ORG                    iep
11                one year     DATE               one year
12  at least  years of age     DATE  at least  year of age
13                  weekly     DATE                 weekly
14        maxim healthcare      ORG       maxim healthcare
15               a century     DATE              a centu

### NEXT STEPS

1. Skill Extraction - NER model for skills
Use a model trained to recognize skills (e.g., SpaCy custom NER, or libraries like SkillNer, Pyresparser, or ESCO-based tools).

2. Associate Skills with Job Titles

example:
skills_by_job = {}

for title, description in zip(job_titles, job_descriptions):
    doc = nlp(description.lower())
    tokens = [token.lemma_ for token in doc if token.lemma_ in skill_set]
    skills_by_job[title] = tokens


3. Clustering Skills into Categories - Automated clustering with embeddings (advanced)
Use word embeddings (like word2vec, spaCy, or sentence-transformers) + KMeans to group similar skills.

example:

from sklearn.cluster import KMeans
import numpy as np

--- Get unique skills
unique_skills = list(set(extracted_skills))

--- Get spaCy vector for each skill
skill_vectors = [nlp(skill).vector for skill in unique_skills]

--- Cluster with KMeans
kmeans = KMeans(n_clusters=5, random_state=42)
labels = kmeans.fit_predict(skill_vectors)

--- Map each skill to its cluster
clusters = {}
for skill, label in zip(unique_skills, labels):
    clusters.setdefault(label, []).append(skill)
