# An In-depth Exploration of US Job Market by analyzing LinkedIn Job Postings using Natural Language Processing Techniques

## Data ingestion and preprocesing
We firstly need to read the webscraped dataset and explore its contents before we begin the analysis.
For a better understanding of one domain we will focus on the jobs regarding the IT industry.

In [11]:
# importing necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import spacy
import re
import json
from ollama import chat
from pydantic import BaseModel, Field
from typing import List

In [12]:
# read the dataset
data = pd.read_csv('/Users/juliabarsow/Desktop/thesis/project_code/postings.csv')

In [13]:
#check how many descriptions are missing as this is the most important column
print("Missing rows of description: ",data['description'].isnull().sum())

#drop rows with missing descriptions
data = data.dropna(subset=['description'])

Missing rows of description:  7


In [35]:
list(data['title'].unique())

['Marketing Coordinator',
 'Mental Health Therapist/Counselor',
 'Assitant Restaurant Manager',
 'Senior Elder Law / Trusts and Estates Associate Attorney',
 ' Service Technician',
 'Economic Development and Planning Intern',
 'Producer',
 'Building Engineer',
 'Respiratory Therapist',
 'Worship Leader',
 'Inside Customer Service Associate',
 'Project Architect',
 "Appalachian Highlands Women's Business Center",
 'Structural Engineer',
 'Senior Product Marketing Manager',
 'Osteogenic Loading Coach',
 'Administrative Coordinator',
 'Customer Service / Reservationist',
 'Content Writer, Communications',
 'Controller',
 'Physician Assistant',
 'Licensed Acupuncturist',
 'Software Engineer',
 'Sheet Metal Fabricator',
 'Personal Injury Attorney',
 'NPE 2024 Exhibition Event Worker',
 'Loan Coordinator',
 'General Laborer',
 'Swim Instructor',
 'Administrative Assistant',
 'Service / Construction Technician',
 'Legal Secretary',
 'Salesperson',
 'Registered Nurse',
 'Marketing & Office Coo

In [None]:
# List of titles to count
titles_to_count = [
    'Full Stack Engineer',
    'Intern- Business Analytics',
    'Enterprise Data & Analytics Infrastructure Manager',
    'Computer Scientist',
    'Front end specialist',
    'Project Engineer',
    'Data Architect',
    'Project Manager',
    'Data Analyst',
    'Java architect / Lead Java developer',
    'Enterprise Data & Analytics Infrastructure Manager',
    'Senior Software Engineer',
    'Web Developer',
    'Software Implementation Program Manager',
    'Vice President - Engineering & Production Operation',
    'Test Engineer',
    'Sr Software Engineer',
    'IT QA Engineer II',
    'Sr. Business Analyst/Tester',
    'Senior Developer – React Native',
    'Cloud DevOps Engineer',
    'Senior Analyst, Data & Analytics',
    'Senior Business Analyst',
    'Engineering Project Manager / Project Manager',
    'Java full Stack Engineer',
    'backend Java developer',
    'Data Science Software Engineer',
    'Data Engineer/ETL'
]

# Count occurrences of each title
title_counts = data['title'].value_counts()

# Filter counts for the specified titles
filtered_counts = title_counts[title_counts.index.isin(titles_to_count)]

# Display the counts
print(filtered_counts)

title
Project Manager                                        354
Senior Software Engineer                               162
Project Engineer                                       102
Full Stack Engineer                                     57
Web Developer                                           43
Senior Business Analyst                                 34
Data Architect                                          27
Test Engineer                                           24
Sr Software Engineer                                     6
Cloud DevOps Engineer                                    3
Senior Developer – React Native                          3
Software Implementation Program Manager                  2
backend Java developer                                   1
Data Science Software Engineer                           1
Data Engineer/ETL                                        1
Computer Scientist                                       1
Front end specialist                              

In [20]:
# shiow descriptions from a specific title
data[data['title'] == 'Project Manager']['description'].values[2]

'Xact Warehouse Solutions is looking for a "rock star" Project Manager in the material handling industry with a go-getter attitude to join our awesome team! Xact is a small business and is looking for a "Go-Getter" with a passion for solving problems for customers and serving those customers with excellence.\nJust a few tasks the Project Manager will do:Proactively manage projects effectively from start to finish.Create proposals, purchase requisitions and all other project tasks.Procure materials and labor contractors.Ensure projects are on-schedule and within budget.Communicate with vendors and customers regarding schedules, deliveries and any changes or updates to the project schedule.Attend and participate in meetings, conferences, training and company events.Manage and communicate changes clearly to customer and project team.\nAbout the Project Manager:Education: Bachelor\'s degree or equivalent experience.Software: Proficient in Microsoft Office applications to include Microsoft 

In [12]:
# ✅ Calculate missing values, available count, and percentage
missing_values = data.isnull().sum().to_frame(name='missing_count')
missing_values['available_count'] = len(data) - missing_values['missing_count']
missing_values['missing_percentage'] = (missing_values['missing_count'] / len(data)) * 100  # Keep as float

# ✅ Reorder columns for better readability (available_count first)
missing_values = missing_values[['available_count', 'missing_count', 'missing_percentage']]

# ✅ Sort by missing percentage (ascending for better features on top)
missing_values = missing_values.sort_values(by='missing_percentage', ascending=True)

# ✅ Reset index and rename it to "column_name"
missing_values = missing_values.reset_index().rename(columns={'index': 'column_name'})

# ✅ Apply clear and meaningful color formatting
styled_missing_values = (
    missing_values.style
    .background_gradient(subset=['available_count'], cmap='Greens')  # More available → Green (good)
    .background_gradient(subset=['missing_count'], cmap='Oranges_r')  # More missing → Darker Orange (bad)
    .background_gradient(subset=['missing_percentage'], cmap='Reds_r')  # Higher missing % → Darker Red (bad)
    .format({'missing_percentage': "{:.2f}%"})  # Format as percentage AFTER styling
)

# ✅ Display dataset size
print(f"The dataset size: {data.shape[0]} rows")

# ✅ Display missing values table with improved color usage
display(styled_missing_values)

The dataset size: 123842 rows


Unnamed: 0,column_name,available_count,missing_count,missing_percentage
0,job_id,123842,0,0.00%
1,work_type,123842,0,0.00%
2,sponsored,123842,0,0.00%
3,listed_time,123842,0,0.00%
4,expiry,123842,0,0.00%
5,application_type,123842,0,0.00%
6,original_listed_time,123842,0,0.00%
7,formatted_work_type,123842,0,0.00%
8,job_posting_url,123842,0,0.00%
9,title,123842,0,0.00%


As we can see, there's a significant amount of missing data, however we will drop columns for every usecase we have

## Skill Extraction and Clustering

### Objective: Extract required skills from job descriptions and cluster them to identify common skill sets across industries.

NLP Techniques: Named Entity Recognition (NER), Topic Modeling, or Clustering algorithms.

Research Questions:
○ What are the most in-demand skills across different sectors?
○ How do skill requirements differ by salary range or job title?
○ What are the salary ranges in different sectors and job positions?


## We will use Ollama: https://ollama.com/ for the skill extraction

In [2]:
!pip install ollama



In [None]:
# runin terminal otherwise it doesnt work
!python3 -m ollama pull llama3.2
!python3 -m ollama run granite3.2:8b

/Users/juliabarsow/Desktop/thesis/project_code/NLP_Job_Postings/.venv/bin/python3: No module named ollama.__main__; 'ollama' is a package and cannot be directly executed


In [None]:
from langchain_community.llms import Ollama

# Initialize Ollama with your chosen model
llm = Ollama(model="llama3.2")

# Invoke the model with a query
response = llm.invoke("What is LLM?")
print(response)

LLM stands for Large Language Model. It's a type of artificial intelligence (AI) model that uses natural language processing (NLP) to understand and generate human-like text.

Large Language Models are trained on vast amounts of text data, which enables them to learn patterns, relationships, and context in language. This training allows the model to:

1. Understand the nuances of language, such as syntax, semantics, and pragmatics.
2. Generate coherent and contextually relevant text, including sentences, paragraphs, and even entire articles or stories.

LLMs are typically trained on massive datasets of text from various sources, including books, articles, websites, and social media platforms. This training process enables the model to learn common language patterns, idioms, and expressions, as well as its own biases and limitations.

Some popular applications of Large Language Models include:

1. Language translation: LLMs can be used to translate text from one language to another with

In [None]:


# 1. Define the desired output structure using Pydantic
# This creates a JSON Schema that Ollama is forced to follow.
class JobRequirements(BaseModel):
    """A structured model to hold the extracted skills and requirements from a job posting."""
    skills: List[str] = Field(
        ..., 
        description="A list of specific, technical, or soft skills required. E.g., 'Python', 'Machine Learning', 'Problem-Solving'."
    )
    requirements: List[str] = Field(
        ..., 
        description="A list of formal requirements, like years of experience, educational degrees, or specific certifications. E.g., 'Bachelor's Degree in Computer Science', '5+ years of experience with Kafka', 'AWS Certified'."
    )

# 2. Define the job posting text you want to analyze
job_posting_text = """
Job Title: Senior Data Scientist

The ideal candidate will have 5+ years of professional experience
working with large datasets and implementing Machine Learning models.
Must have a Master's Degree in a quantitative field.
Required technical skills include expert-level proficiency in Python, 
experience with PyTorch for deep learning, and practical knowledge of 
cloud platforms, specifically AWS. We are looking for strong problem-solving 
abilities and excellent communication skills. Experience with Kafka 
is a plus. Do not apply if you do not meet the experience requirements.
"""

# 3. Define the prompt and configuration for Ollama
# We use the chat API for better control and adherence to instructions.
def extract_job_data(job_post: str, model_name: str = "llama3.2") -> JobRequirements:
    """
    Sends the job posting to an Ollama model and extracts structured data.
    """
    
    # Generate the JSON schema from the Pydantic model
    schema = JobRequirements.model_json_schema()

    # The system prompt guides the model's behavior
    system_prompt = (
        "You are an expert HR data extraction bot. Your task is to accurately "
        "extract the required skills and formal requirements from the provided job posting. "
        "Do not include any commentary or additional text. "
        "The output MUST conform strictly to the provided JSON schema."
    )
    
    # The user prompt contains the data to be analyzed
    user_prompt = f"Analyze the following job posting and return the extracted data:\n\n---\n{job_post}"
    
    print(f"--- Sending request to Ollama with model: {model_name} ---")

    try:
        response = chat(
            model=model_name,
            messages=[
                {'role': 'system', 'content': system_prompt},
                {'role': 'user', 'content': user_prompt}
            ],
            # This is the key setting for structured output!
            format=schema,
            # Use temperature 0 for deterministic, reliable extraction
            options={'temperature': 0}
        )
        
        # The model's content is a JSON string conforming to the schema
        json_string = response['message']['content']
        
        # Validate and convert the JSON string back into a Pydantic object
        extracted_data = JobRequirements.model_validate_json(json_string)
        
        return extracted_data

    except Exception as e:
        print(f"An error occurred: {e}")
        print("Ensure Ollama server is running and the model is pulled.")
        return None

# 4. Execute the extraction and display the results
if __name__ == "__main__":
    result = extract_job_data(job_posting_text)

    if result:
        print("\n--- Extracted Data (Structured Output) ---")
        print(f"Model Used: llama3")
        print("-" * 35)
        print(f"SKILLS: \n  - " + "\n  - ".join(result.skills))
        print(f"\nFORMAL REQUIREMENTS: \n  - " + "\n  - ".join(result.requirements))
        
        # You can also access the raw data as a dictionary
        # print("\nRaw Dictionary Output:", result.model_dump())

--- Sending request to Ollama with model: llama3.2 ---

--- Extracted Data (Structured Output) ---
Model Used: llama3
-----------------------------------
SKILLS: 
  - Python
  - PyTorch
  - AWS

FORMAL REQUIREMENTS: 
  - Master's Degree in quantitative field
  - 5+ years of professional experience working with large datasets and implementing Machine Learning models
  - Strong problem-solving abilities
  - Excellent communication skills
  - Experience with Kafka (optional)


### NEXT STEPS

1. Skill Extraction - NER model for skills
Use a model trained to recognize skills (e.g., SpaCy custom NER, or libraries like SkillNer, Pyresparser, or ESCO-based tools).

2. Associate Skills with Job Titles

example:
skills_by_job = {}

for title, description in zip(job_titles, job_descriptions):
    doc = nlp(description.lower())
    tokens = [token.lemma_ for token in doc if token.lemma_ in skill_set]
    skills_by_job[title] = tokens


3. Clustering Skills into Categories - Automated clustering with embeddings (advanced)
Use word embeddings (like word2vec, spaCy, or sentence-transformers) + KMeans to group similar skills.

example:

from sklearn.cluster import KMeans
import numpy as np

--- Get unique skills
unique_skills = list(set(extracted_skills))

--- Get spaCy vector for each skill
skill_vectors = [nlp(skill).vector for skill in unique_skills]

--- Cluster with KMeans
kmeans = KMeans(n_clusters=5, random_state=42)
labels = kmeans.fit_predict(skill_vectors)

--- Map each skill to its cluster
clusters = {}
for skill, label in zip(unique_skills, labels):
    clusters.setdefault(label, []).append(skill)
