# Data 400 Project - Charlene Vu & Phuong Anh Phi
# Decoding Tech Hiring Trends: A Keyword Analysis of LinkedIn Job Postings ü§ìüíº

## Overview

This project politely stalks (aka scrapes) LinkedIn job postings, cleans up messy job descriptions, and then uses topic modeling (LDA) to figure out what kinds of work, skills, and values are actually being described.

Instead of asking how many jobs exist or which skill appears the most, this project focuses on the language of job ads ‚Äî what themes keep showing up, and how those themes change depending on job context like work setting, seniority, and location.

In short: I make the computer read a bunch of job postings so I don‚Äôt have to.


## Why This Topic? ü§îüìö

As a student applying to jobs, it‚Äôs hard to tell what a role really means beyond the title. Two postings can look similar on the surface but describe totally different expectations, skills, and work styles.

Job descriptions aren‚Äôt just skill checklists ‚Äî they also reflect how companies define work, responsibility, culture, and impact. This project uses topic modeling to explore those hidden patterns and see how the language of work changes across contexts, like junior vs senior roles or remote vs on-site jobs.


## Research Question üîç

Rather than asking ‚Äúhow many jobs are out there?‚Äù, this project asks What themes show up in LinkedIn job postings, and how do those themes differ by job context?

More specifically:

* What themes commonly appear in job descriptions?
* How do these themes differ by work setting (remote, hybrid, on-site)?
* How does job language change between junior and senior roles?
* How do ideas like culture, impact, and responsibility show up in job ads?

Using Latent Dirichlet Allocation (LDA), the goal is to uncover the hidden structure of how jobs are described.


## Project Structure üóÇÔ∏è

Because chaos is not a strategy:

* `scraping_notebook.ipynb` ‚Äì Scrapes LinkedIn job postings
* `Vu C-Phi A-DATA400 Final Project.ipynb` ‚Äì Data cleaning, EDA, and topic modeling
* `dataset/jobs.csv` ‚Äì Cleaned dataset used for analysis
* `slides/presentation.pdf` ‚Äì Summary of methods and findings
* `README.md` ‚Äì You‚Äôre reading it üëÄ


## Data Collection üïµÔ∏è‚Äç‚ôÄÔ∏è

* Source: LinkedIn job postings
* How: Python scraping (LinkedIn was‚Ä¶ not a fan)
* Data collected includes:

  * Job title
  * Job description
  * Company
  * Location
  * Job URL

### Limitations üòÖ

* Salary data was mostly missing and not used
* Scraping limits reduced the size of the dataset
* Job descriptions vary a lot in length and quality


## Data Processing üßπ

Turning chaos into slightly less chaos:

* Removed duplicate job postings
* Dropped postings with missing job descriptions
* Cleaned job text by:

  * Lowercasing
  * Removing punctuation, numbers, and special characters
  * Removing stopwords
* Filtered very short job descriptions (LDA does not like those)


## Feature Engineering üß†‚ú®

* Created a cleaned text column specifically for NLP
* Inferred job context features from text:

  * Work setting (remote / hybrid / on-site)
  * Seniority level (junior vs senior, based on keywords)
* Converted job descriptions into a document‚Äìterm matrix (DTM) using `CountVectorizer`

Basically: turning words into numbers so LDA can do its thing.


## Modeling & Analysis üìä

The main analysis uses **Latent Dirichlet Allocation (LDA)** to identify common themes in job descriptions.

Steps:

* Built DTMs from cleaned job descriptions
* Trained LDA models with different numbers of topics
* Selected a topic count that balanced interpretability and coherence
* Examined top keywords per topic to interpret themes
* Compared how topic prevalence differs across:

  * Work settings
  * Seniority levels


## Results & Insights üîç

What the model revealed:

* Job postings consistently cluster around themes like:

  * Technical skills
  * Communication and collaboration
  * Responsibility and ownership
  * Company culture and impact
* Senior roles emphasize leadership, ownership, and strategy
* Junior roles focus more on learning, support, and execution
* Remote roles tend to highlight autonomy and communication more than on-site roles

Job titles don‚Äôt tell the full story ‚Äî the language does.


## Tools & Tech üõ†Ô∏è

* Python
* pandas, numpy
* matplotlib
* scikit-learn
* Jupyter Notebook


## Limitations üôÉ

* Dataset size is limited by scraping restrictions
* Topic interpretation is subjective (LDA gives themes, not labels)
* Results reflect a snapshot in time, not long-term trends


## Future Improvements üöÄ

If I had more time (and fewer LinkedIn blocks):

* Collect more data across industries and regions
* Use dynamic topic modeling to track changes over time
* Compare LDA with embedding-based topic models
* Connect themes more directly to job outcomes (like seniority or pay)


## Ethical Stuff üßë‚Äç‚öñÔ∏è

* Only publicly available job postings were used
* No personal data was collected
* This project is for academic and educational purposes only (promise)
