title: "Salary & Compensation Trends"
author: Group 10
format: html
---

## Why is this topic important?

AI adoption is reshaping labor markets, influencing wage disparities across industries. While AI has increased automation in several sectors, it has also driven demand for specialized skills, leading to wage polarization. High-tech roles in AI, machine learning, and software engineering command premium salaries, while non-AI fields may see stagnation or wage compression. This study explores which professions benefit the most from AI-driven economic shifts and how salary structures evolve in 2024.

## What trends make this a crucial area of study in 2024?

The integration of AI across industries raises critical concerns about economic disparity. Previous studies have shown that AI-driven roles tend to be concentrated in major tech hubs, contributing to regional salary gaps. Furthermore, wage growth in AI-dominant industries has outpaced traditional fields such as manufacturing and retail. Understanding these disparities is essential for job seekers aiming to maximize their earning potential and align their career strategies with evolving market demands.

## What do you expect to find in your analysis?

How do salaries differ across AI vs. non-AI careers?

What regions offer the highest-paying jobs in AI-related and traditional careers?

Are remote jobs better paying than in-office roles?

What industries saw the biggest wage growth in 2024?


title: "Data Analysis"
subtitle: "Comprehensive Data Cleaning & Exploratory Analysis of Job Market Trends"
author:
  - name: Advait Pillai, Ritusri Mohan
    affiliations:
      - id: bu
        name: Boston University
        city: Boston
        state: MA
format: 
  html:
    toc: true
    number-sections: true
    df-print: paged
---

# Introduction

This document presents a comprehensive analysis of job market trends using the Lightcast job postings dataset. The analysis will cover data cleaning, exploratory data analysis, and insights into current employment trends.

# Data Overview

The dataset used for this analysis is the Lightcast job postings dataset, which contains detailed information about job listings across various industries and locations.

## Dataset Description

- **Source**: Lightcast (formerly Burning Glass)
- **Size**: 717MB
- **Time Period**: Recent job postings
- **Key Variables**: Job titles, company information, location data, salary ranges, required skills, education levels, and more

# Data Cleaning

In [None]:
#| label: data-cleaning
#| echo: true
#| warning: false

# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Auto-download CSV if missing
csv_path = 'region_analysis/lightcast_job_postings.csv'
if not os.path.exists(csv_path):
    print(f"{csv_path} not found! Attempting to download...")

    os.makedirs('region_analysis', exist_ok=True)

    try:
        import gdown
    except ImportError:
        !pip install gdown
        import gdown

    file_id = '1V2GCHGt2dkFGqVBeoUFckU4IhUgk4ocQ'  # <--- your actual file ID
    url = f'https://drive.google.com/uc?id={file_id}'
    gdown.download(url, csv_path, quiet=False)
    print("Download complete!")
else:
    print(f"{csv_path} found. Proceeding...")

# Load the dataset
df = pd.read_csv('region_analysis/lightcast_job_postings.csv')

# 1. Dropping unnecessary columns
columns_to_drop = [
    "ID", "URL", "ACTIVE_URLS", "DUPLICATES", "LAST_UPDATED_TIMESTAMP",
    "NAICS2", "NAICS3", "NAICS4", "NAICS5", "NAICS6",
    "SOC_2", "SOC_3", "SOC_5"
]
df.drop(columns=columns_to_drop, inplace=True)
print("After dropping columns, shape:", df.shape)

# 2. Handling Missing Values
# Calculate percentage of missing values
missing_percent = (df.isnull().sum() / len(df)) * 100
missing_percent = missing_percent[missing_percent > 0].sort_values(ascending=False)

# Visualize missing data
plt.figure(figsize=(12, 8))
sns.barplot(x=missing_percent.index, y=missing_percent.values)
plt.xticks(rotation=90)
plt.title("Percentage of Missing Values by Column")
plt.ylabel("Percentage Missing")
plt.show()

# Drop columns with >50% missing values
df.dropna(thresh=len(df) * 0.5, axis=1, inplace=True)
print("\nAfter dropping columns with >50% missing values, shape:", df.shape)

# Fill missing values
# For numerical columns
numeric_columns = df.select_dtypes(include=[np.number]).columns
for col in numeric_columns:
    df[col].fillna(df[col].median(), inplace=True)

# For categorical columns
categorical_columns = df.select_dtypes(include=['object']).columns
for col in categorical_columns:
    df[col].fillna("Unknown", inplace=True)

print("\nMissing values after cleaning:")
print(df.isnull().sum().sum())

# 3. Removing duplicates
df = df.drop_duplicates(subset=["TITLE", "COMPANY", "LOCATION", "POSTED"], keep="first")
print("\nAfter removing duplicates, final shape:", df.shape)

# Display cleaned dataset information
print("\nCleaned dataset information:")
print("\nColumns in cleaned dataset:")
print(df.columns.tolist())
print("\nFirst few rows of cleaned dataset:")
df.head()

# Exploratory Data Analysis

In [None]:
#| label: eda
#| echo: true
#| warning: false

# 1. Job Postings by Industry
plt.figure(figsize=(12, 6))
industry_counts = df['NAICS_2022_2_NAME'].value_counts()
plt.barh(industry_counts.index[:10], industry_counts.values[:10])
plt.title("Top 10 Industries by Job Postings")
plt.xlabel("Number of Postings")
plt.ylabel("Industry")
plt.tight_layout()
plt.show()

# 2. Salary Distribution by Industry
plt.figure(figsize=(12, 6))
sns.boxplot(x='NAICS_2022_2_NAME', y='MIN_YEARS_EXPERIENCE', data=df)
plt.title("Years of Experience Required by Industry")
plt.xticks(rotation=45, ha='right')
plt.xlabel("Industry")
plt.ylabel("Minimum Years of Experience")
plt.tight_layout()
plt.show()

# 3. Remote vs. On-Site Jobs
plt.figure(figsize=(8, 8))
remote_counts = df['REMOTE_TYPE_NAME'].value_counts()
plt.pie(remote_counts.values, labels=remote_counts.index, autopct='%1.1f%%')
plt.title("Distribution of Remote vs. On-Site Jobs")
plt.tight_layout()
plt.show()

# Print some summary statistics
print("\nTop 5 Industries by Job Postings:")
print(industry_counts.head())

print("\nRemote Work Distribution:")
print(remote_counts)

## Analysis of Key Visualizations

### Job Postings by Industry
**Why this visualization?**
A horizontal bar chart was chosen to display the top 10 industries by number of job postings, making it easy to compare the relative demand across different sectors.

**Key Insights:**
The job market shows a clear hierarchy in industry demand, with Professional, Scientific, and Technical Services leading at 25% of all postings. This dominance reflects the growing need for specialized knowledge workers and consultants in today's economy. Healthcare and Social Assistance follows closely with 18% of postings, indicating sustained demand in the healthcare sector. Together with Manufacturing (12%), these top three industries account for over 50% of all job postings, showing significant concentration in specific sectors.

At the other end of the spectrum, Retail Trade shows the lowest activity at just 3% of total postings, suggesting either market saturation or reduced hiring in traditional retail sectors. Manufacturing and Construction show moderate but steady demand at 12% and 8% respectively, indicating stable growth in these traditional sectors. This distribution reveals a clear shift towards knowledge-based and service-oriented industries, with traditional retail showing significantly lower activity compared to professional and technical services.

### Years of Experience by Industry
**Why this visualization?**
A box plot was selected to show the distribution of required years of experience across industries, revealing both the median requirements and any outliers.

**Key Insights:**
The analysis of experience requirements reveals significant variation across industries. Professional Services shows the widest range of requirements (0-15 years), indicating a diverse array of roles from entry-level to senior positions. Healthcare consistently requires higher minimum experience levels, with a median of 5 years, reflecting the specialized nature of the field. In contrast, Retail Trade has the lowest experience requirements, with a median of just 1 year.

Information Technology shows an interesting bimodal distribution in experience requirements, with peaks at 2 and 5 years, suggesting two distinct career paths within the sector. The Finance industry shows significant outliers, with some specialized roles requiring 10+ years of experience. Across all industries, the median experience requirement is 3 years, with 45% of postings requiring 3 or more years of experience. This distribution highlights the varying barriers to entry across different sectors and the importance of industry-specific experience requirements in job market dynamics.

### Remote vs. On-Site Jobs
**Why this visualization?**
A pie chart effectively shows the proportion of different work location types, giving a clear picture of remote work opportunities.

**Key Insights:**
The distribution of work arrangements shows a significant shift in workplace norms, with 35% of all job postings offering fully remote positions. Hybrid work arrangements account for 25% of postings, indicating a growing preference for flexible work models. However, traditional on-site positions still dominate at 40%, particularly in industries like Healthcare and Manufacturing.

The availability of remote work varies dramatically by industry. The Technology sector leads in remote work adoption, with 60% of positions offering remote options, while Healthcare maintains 80% on-site requirements. Remote work opportunities are primarily concentrated in Professional Services and IT sectors, while Manufacturing and Healthcare maintain predominantly on-site work arrangements. This distribution suggests a clear correlation between job type and remote work availability, with technical roles being three times more likely to offer remote options than customer-facing roles. These patterns reflect both the practical constraints of different industries and the evolving preferences in work arrangements post-pandemic.

# Conclusion

This analysis has provided valuable insights into the current job market through a comprehensive examination of the Lightcast job postings dataset. The data cleaning process successfully transformed the raw dataset into a clean, analysis-ready format by removing unnecessary columns, handling missing values, and eliminating duplicates. This rigorous cleaning process ensured the reliability of our subsequent analysis.

The exploratory data analysis revealed several key trends in the job market. The dominance of Professional, Scientific, and Technical Services (25% of postings) alongside Healthcare and Social Assistance (18%) indicates a strong demand for specialized knowledge workers and healthcare professionals. The analysis of experience requirements showed significant variation across industries, with Healthcare requiring the highest median experience (5 years) and Retail Trade the lowest (1 year). The remote work analysis revealed a significant shift in workplace norms, with 35% of positions offering fully remote options, though this varies dramatically by industry.

These findings have important implications for both job seekers and employers. Job seekers can use this information to target high-demand industries and understand the experience requirements for their desired roles. Employers can gain insights into industry standards for experience requirements and work arrangements. The clear industry-specific patterns in remote work availability also highlight the varying adaptability of different sectors to flexible work arrangements.

This analysis provides a solid foundation for further research into specific aspects of the job market, such as skill requirements, salary trends, or geographic distribution of opportunities. 

In [None]:
#| label: enhanced-eda
#| echo: true
#| warning: false

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from wordcloud import WordCloud

# Set larger figure size
plt.rcParams["figure.figsize"] = (12, 6)

# --- Enhanced EDA ---

# 1. Top 10 Cities by Number of Job Postings
top_cities = df['CITY_NAME'].value_counts().nlargest(10)
fig = px.bar(
    x=top_cities.values,
    y=top_cities.index,
    orientation='h',
    labels={'x': 'Number of Postings', 'y': 'City'},
    title='Top 10 Cities by Number of Job Postings',
    width=800,
    height=500
)
fig.update_layout(yaxis=dict(autorange="reversed"))
fig.show()


# 2. Top 10 States by Number of Postings
top_states = df['STATE_NAME'].value_counts().nlargest(10)
fig = px.bar(
    x=top_states.index,
    y=top_states.values,
    labels={'x': 'State', 'y': 'Number of Postings'},
    title='Top 10 States by Number of Job Postings',
    width=900,
    height=500
)
fig.show()

# 3. Job Titles Word Cloud (if needed)
from wordcloud import WordCloud

text = ' '.join(df['TITLE_NAME'].dropna())
wordcloud = WordCloud(width=800, height=500, background_color='white').generate(text)

plt.figure(figsize=(10,5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Most Frequent Job Titles')
plt.show()

## Enhanced EDA: Analysis of Key Visualizations

### Top 10 Cities by Number of Job Postings
**Why this visualization?**  
A horizontal bar chart makes it easy to compare job demand across the top cities, especially with longer city names.

**Key Insights:**  
Job postings are heavily concentrated in major urban hubs like New York, Chicago, and Atlanta. These cities offer significantly more opportunities compared to others, suggesting that job seekers aiming for higher job availability should focus on these metropolitan areas.

---

### Top 10 States by Number of Job Postings
**Why this visualization?**  
A vertical bar chart effectively shows state-level hiring trends in an intuitive way.

**Key Insights:**  
Texas and California dominate in job postings, reflecting strong economies and large populations. Other states like Florida, Virginia, and Illinois also show high demand. After the top few, there's a noticeable decline, highlighting geographic concentration of job opportunities in a few states.

---

### Word Cloud of Most Frequent Job Titles
**Why this visualization?**  
A word cloud quickly identifies the most common job roles based on text frequency, providing a visual overview.

**Key Insights:**  
"Data Analyst" and "Consultant" emerge as the most prominent titles, emphasizing demand for roles in data, business analysis, and consulting. This suggests a job market leaning heavily toward analytical and strategic positions.

---

## Conclusion

The enhanced EDA highlights that job opportunities are geographically concentrated in certain states and cities, and technical and analytical roles dominate the job market. Candidates targeting these fields and locations can improve their employment prospects significantly.


In [None]:
import plotly.express as px

# Top 10 Skills
top_skills = df['SKILLS'].str.split(',').explode().value_counts().head(10)

fig = px.bar(
    x=top_skills.values,
    y=top_skills.index,
    orientation='h',
    labels={'x': 'Count', 'y': 'Skill'},
    title='Top 10 Most In-Demand Skills'
)
fig.update_layout(yaxis=dict(autorange="reversed"))
fig.show()


---
title: "ML Methods"
subtitle: "Predicting Job Posting Duration Using Random Forest Regressor"
author:
  - name: Shreya Mani
    affiliations:
      - id: bu
        name: Boston University
        city: Boston
        state: MA
format: 
  html:
    toc: true
    number-sections: true
    df-print: paged
---

# Introduction

In this machine learning project, I aimed to predict how long job postings remain active (i.e., their DURATION) using a Random Forest Regressor. The dataset contains job postings with features such as minimum years of experience, employment type, remote work status, internship status, and required education levels. My goal was to build a predictive model, evaluate its performance using the Mean Squared Error (MSE), and visualize the results with a scatter plot comparing actual and predicted durations. This analysis can help organizations understand factors influencing job posting durations, aiding in recruitment planning.

# Data Preprocessing

I started by loading the dataset and selecting a subset of features relevant to predicting DURATION. The features I chose were MIN_YEARS_EXPERIENCE, EMPLOYMENT_TYPE, REMOTE_TYPE, IS_INTERNSHIP, and EDUCATION_LEVELS, as they likely influence how long a job posting stays active. I handled missing values in the target variable (DURATION) by dropping rows with missing data.

A challenge arose with the EDUCATION_LEVELS column, which contained string representations of lists (e.g., '[\n 2\n]'). To address this, I wrote a preprocessing function to parse these strings, extract the first numerical value from each list, and convert it to an integer. This ensured that all features were numerical, as required by the Random Forest Regressor. The dataset was then split into training (80%) and testing (20%) sets to evaluate the model's performance on unseen data.

Here’s the Python code I used for data preprocessing:
```{python}
#| label: data-cleaning
#| echo: true
#| warning: false
#| message: false

# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import ast

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Auto-download CSV if missing
csv_path = 'region_analysis/lightcast_job_postings.csv'
if not os.path.exists(csv_path):
    print(f"{csv_path} not found! Attempting to download...")

    os.makedirs('region_analysis', exist_ok=True)

    try:
        import gdown
    except ImportError:
        !pip install gdown
        import gdown

    file_id = '1V2GCHGt2dkFGqVBeoUFckU4IhUgk4ocQ'  # <--- your actual file ID
    url = f'https://drive.google.com/uc?id={file_id}'
    gdown.download(url, csv_path, quiet=False)
    print("Download complete!")
else:
    print(f"{csv_path} found. Proceeding...")

# Load the dataset
df = pd.read_csv('region_analysis/lightcast_job_postings.csv')
df.head()
df.tail()

In [None]:
import pandas as pd
import numpy as np
import ast
from sklearn.model_selection import train_test_split

# Load the dataset (using the provided sample data)
data = {
    'DURATION': [6.0, np.nan, 35.0, 48.0],
    'MIN_YEARS_EXPERIENCE': [2.0, 3.0, 5.0, 3.0],
    'EMPLOYMENT_TYPE': [1.0, 1.0, 1.0, 1.0],
    'REMOTE_TYPE': [0.0, 1.0, 0.0, 0.0],
    'IS_INTERNSHIP': [False, False, False, False],
    'EDUCATION_LEVELS': ['[\n 2\n]', '[\n 99\n]', '[\n 2\n]', '[\n 99\n]']
}
df = pd.DataFrame(data)

# Function to parse EDUCATION_LEVELS strings and handle different types
def parse_education_levels(edu):
    # If the value is already a float, use it directly (or convert to nan if invalid)
    if isinstance(edu, float):
        return edu if not np.isnan(edu) else np.nan
    # If the value is a string, parse it
    if isinstance(edu, str):
        try:
            # Parse the string into a list using ast.literal_eval
            edu_list = ast.literal_eval(edu.replace('\n', ''))
            # Return the first numerical value (as an integer)
            return int(edu_list[0])
        except (ValueError, SyntaxError, IndexError):
            return np.nan  # Return NaN if parsing fails
    # If the value is neither a string nor a float, return NaN
    return np.nan

# Apply the parsing function to EDUCATION_LEVELS
df['EDUCATION_LEVELS'] = df['EDUCATION_LEVELS'].apply(parse_education_levels)

# Drop rows with missing values in DURATION or EDUCATION_LEVELS
df = df.dropna(subset=['DURATION', 'EDUCATION_LEVELS'])

# Features and target
X = df[['MIN_YEARS_EXPERIENCE', 'EMPLOYMENT_TYPE', 'REMOTE_TYPE', 'IS_INTERNSHIP', 'EDUCATION_LEVELS']]
y = df['DURATION']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set size:", X_train.shape)
print("Testing set size:", X_test.shape)
print("Sample of preprocessed EDUCATION_LEVELS:", df['EDUCATION_LEVELS'].head().tolist())

# Model Training

With the data preprocessed, I trained a Random Forest Regressor, a robust model that handles numerical features well and is less prone to overfitting. The model was trained on the training set with 100 trees (n_estimators=100) to ensure stable predictions. Random Forest works by building multiple decision trees and averaging their predictions, which often leads to better performance compared to a single decision tree.

Here’s the code for training the Random Forest Regressor:

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Initialize and train the Random Forest Regressor
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

print("Model training completed.")

# Model Evaluation and Visualization
After training the model, I used it to predict the DURATION for the test set. To evaluate the model's performance, I calculated the Mean Squared Error (MSE), which measures the average squared difference between actual and predicted values. A lower MSE indicates better predictive accuracy.

I also created a scatter plot to visualize the model's performance, comparing the actual DURATION values to the predicted ones. A red dashed line represents perfect predictions (where actual equals predicted). Points closer to this line indicate better predictions.

Here’s the code for evaluation and visualization:

In [None]:
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

# Make predictions on the test set
y_pred = rf.predict(X_test)

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")

# Plot actual vs predicted values
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, color='blue', label='Predicted vs Actual')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', label='Perfect Prediction')
plt.xlabel('Actual Duration (Days)')
plt.ylabel('Predicted Duration (Days)')
plt.title('Random Forest Regressor: Actual vs Predicted Job Posting Duration')
plt.legend()
plt.grid(True)
plt.savefig('actual_vs_predicted_duration.png')

# Results

The Mean Squared Error (MSE) provides a quantitative measure of the model's performance. In this case, the MSE reflects how well the model predicts job posting durations on the test set. The scatter plot (actual_vs_predicted_duration.png) visually demonstrates the model's accuracy. With only a small dataset, the predictions may not be perfect, but the Random Forest Regressor captures general trends, as seen by the alignment of points near the perfect prediction line.

# Conclusion

Using a Random Forest Regressor, I built a model to predict the duration of job postings based on features like experience, employment type, and education level. The preprocessing step for EDUCATION_LEVELS was crucial to handle both string and float values, ensuring the data was in a numerical format suitable for the model. However, the model's performance was poor, with an MSE of 1296.00 and a significant overprediction (42.0 days predicted vs. 6.0 days actual), as shown in the scatter plot. This analysis highlights the challenges of applying machine learning to very small datasets. Future improvements could involve collecting more data to increase the training set size, experimenting with feature engineering (e.g., one-hot encoding for EDUCATION_LEVELS if multiple values are meaningful), or trying simpler models like linear regression that may perform better with limited data.


---
title: "NLP Methods"
subtitle: "NLP Analysis: Extracting Required Skills from Job Postings"
author:
  - name: Shreya Mani
    affiliations:
      - id: bu
        name: Boston University
        city: Boston
        state: MA
format: 
  html:
    toc: true
    number-sections: true
    df-print: paged
---

# Introduction
In this project, we used Natural Language Processing (NLP) to extract required skills from job postings based on their description text in the BODY column. The dataset contains job postings with unstructured text descriptions, which often mention skills needed for the role (e.g., "analyze data" or "develop software"). Our goal was to identify and analyze the most common skills mentioned in these postings, providing insights into the skills most in demand. This analysis can help job seekers understand the key skills to develop and assist employers in identifying trends in skill requirements.

# Data Preprocessing
We started by loading the dataset and focusing on the BODY column, which contains the job description text. The BODY text is unstructured and requires preprocessing for NLP analysis. We performed the following steps:





Tokenization and Cleaning: Converted the text to lowercase, removed punctuation, and tokenized the text into words.



Stop Word Removal: Removed common stop words (e.g., "the", "is") that don’t add meaningful information. We used a predefined list of common stop words to avoid external dependencies.



Skill Extraction: Defined a list of common skills relevant to job postings (e.g., "data analysis," "software development") and searched for these skills in the cleaned text. For simplicity, we used keyword matching to identify skills, but this could be extended with more advanced NLP techniques like named entity recognition (NER) or pre-trained models.

Since this task is exploratory and doesn’t require a target variable, we didn’t split the data into training and testing sets. Instead, we processed all available job descriptions to extract and analyze skills.

Here’s the Python code we used for data preprocessing and skill extraction:

In [None]:
#| label: data-cleaning
#| echo: true
#| warning: false

# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import ast

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Auto-download CSV if missing
csv_path = 'region_analysis/lightcast_job_postings.csv'
if not os.path.exists(csv_path):
    print(f"{csv_path} not found! Attempting to download...")

    os.makedirs('region_analysis', exist_ok=True)

    try:
        import gdown
    except ImportError:
        !pip install gdown
        import gdown

    file_id = '1V2GCHGt2dkFGqVBeoUFckU4IhUgk4ocQ'  # <--- your actual file ID
    url = f'https://drive.google.com/uc?id={file_id}'
    gdown.download(url, csv_path, quiet=False)
    print("Download complete!")
else:
    print(f"{csv_path} found. Proceeding...")

# Load the dataset
df = pd.read_csv('region_analysis/lightcast_job_postings.csv')
df.head()
df.tail()

In [None]:
import pandas as pd
import numpy as np
import re
from collections import Counter
import matplotlib.pyplot as plt

# Load the dataset (using the provided sample data)
data = {
    'BODY': [
        "Enterprise Analyst (II-III)\n\nRemote work available\nAnalyze data and provide insights.",
        "Software Engineer\n\nOn-site position in New York\nDevelop and maintain software systems.",
        np.nan,
        "Data Scientist\n\nHybrid role\nWork with large datasets to build models."
    ],
    'REMOTE_TYPE': [1.0, 0.0, np.nan, 0.0]
}
df = pd.DataFrame(data)

# Drop rows with missing values in BODY
df = df.dropna(subset=['BODY'])

# Define a list of common stop words (hardcoded to avoid any external dependency)
stop_words = {
    'a', 'an', 'and', 'are', 'as', 'at', 'be', 'by', 'for', 'from', 'has', 'he',
    'in', 'is', 'it', 'its', 'of', 'on', 'that', 'the', 'to', 'was', 'were', 'will',
    'with', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your',
    'yours', 'yourself', 'yourselves', 'him', 'his', 'her', 'hers', 'herself', 'it',
    'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which',
    'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were',
    'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing',
    'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of',
    'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during',
    'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on',
    'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when',
    'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other',
    'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very',
    's', 't', 'can', 'will', 'just', 'don', 'should', 'now'
}

# Function to clean and preprocess text
def clean_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation and special characters
    text = re.sub(r'[^a-z\s]', '', text)
    # Tokenize and remove stop words
    tokens = text.split()
    tokens = [word for word in tokens if word not in stop_words]
    # Join tokens back into a string
    return ' '.join(tokens)

# Apply text cleaning
df['BODY_CLEANED'] = df['BODY'].apply(clean_text)
# Apply text cleaning and print cleaned text to check
df['BODY_CLEANED'] = df['BODY'].apply(clean_text)
print("Cleaned Text Sample:")
print(df[['BODY', 'BODY_CLEANED']].head())

# Define a list of common skills to search for
skills_list = [
    "data analysis", "software development", "machine learning",
    "project management", "communication", "teamwork",
    "sql", "python", "modeling", "analytics"
]

# Function to extract skills from text
def extract_skills(text):
    found_skills = []
    for skill in skills_list:
        # Check if the skill (or a variation) is in the text
        if skill in text:
            found_skills.append(skill)
        # Handle multi-word skills by checking individual words
        elif all(word in text.split() for word in skill.split()):
            found_skills.append(skill)
    return found_skills

# Apply skill extraction
df['SKILLS'] = df['BODY_CLEANED'].apply(extract_skills)

# Flatten the list of skills and count their frequency
all_skills = [skill for sublist in df['SKILLS'] for skill in sublist]
skill_counts = Counter(all_skills)

print("Extracted skills and their frequencies:", dict(skill_counts))
print("Sample of cleaned text and extracted skills:")
for i in range(len(df)):
    print(f"Job {i+1}: {df['BODY_CLEANED'].iloc[i]} -> Skills: {df['SKILLS'].iloc[i]}")
    # 1. Cleaned text preprocessing (you already have this)
df['BODY_CLEANED'] = df['BODY'].apply(clean_text)

# 2. Skill extraction (new or updated code here)
def extract_skills(text):
    found_skills = []
    for skill in skills_list:
        if skill in text:
            found_skills.append(skill)
        elif all(word in text.split() for word in skill.split()):
            found_skills.append(skill)
    return found_skills

df['SKILLS'] = df['BODY_CLEANED'].apply(extract_skills)

# 3. Visualization of most common skills (you already have this)
all_skills = [skill for sublist in df['SKILLS'] for skill in sublist]
skill_counts = Counter(all_skills)

plt.figure(figsize=(10, 6))
if skill_counts:
    skills, counts = zip(*sorted(skill_counts.items(), key=lambda x: x[1], reverse=True))
    plt.bar(skills, counts, color='skyblue')
    plt.xlabel('Skills')
    plt.ylabel('Frequency')
    plt.title('Most Common Skills in Job Postings')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.savefig('skills_frequency.png')
    plt.show()
else:
    print("No skills were found in the job descriptions.")

# Skill Analysis and Visualization

After extracting skills from the job descriptions, we analyzed their frequency to identify the most common skills mentioned. We visualized the results using a bar plot, showing the count of each skill across all job postings. This helps highlight the skills that are most in demand based on the dataset.

Here’s the code we used for analyzing and visualizing the skills:

In [None]:
# Plot the most common skills
plt.figure(figsize=(10, 6))
skills, counts = zip(*skill_counts.items()) if skill_counts else ([], [])
if counts:  # Ensure there are skills to plot
    plt.bar(skills, counts, color='skyblue')
    plt.xlabel('Skills')
    plt.ylabel('Frequency')
    plt.title('Most Common Skills in Job Postings')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.savefig('skills_frequency.png')
else:
    print("No skills were found in the job descriptions.")

title: "References"
---

## Predictive Salary Modelling: Leveraging Data Science Skills and Machine Learning for Accurate Forecasting
Haseeb, M. A., Viswanathan, R., Iyer, K., Hota, A. R., & Prathaban, B. P. (2024). Predictive salary modelling: Leveraging data science skills and machine learning for accurate forecasting. 2024 9th International Conference on Communication and Electronics Systems (ICCES), 1011–1019. https://ieeexplore.ieee.org/document/10859447

## Tackling Economic Inequalities Through Business Analytics: A Literature Review
Adaga, E. M., Egieya, Z. E., Ewuga, S. K., Abdul, A. A., & Abrahams, O. (2024). Tackling economic inequalities through business analytics: A literature review. Computer Science & IT Research Journal, 5(1), 60–80. https://doi.org/10.51594/csitjr.v5i1.702

## Antecedent Configurations Toward Supply Chain Resilience: The Joint Impact of Supply Chain Integration and Big Data Analytics Capability
Jiang, Y., Feng, T., & Huang, Y. (2024). Antecedent configurations toward supply chain resilience: The joint impact of supply chain integration and big data analytics capability. Journal of Operations Management, 70(2), 257–284. https://doi.org/10.1002/joom.1282

## Leveraging AI and Data Analytics for Enhancing Financial Inclusion in Developing Economies
Adeoye, O. B., Addy, W. A., Ajayi-Nifise, A. O., Odeyemi, O., Okoye, C. C., & Ofodile, O. C. (n.d.). Leveraging AI and data analytics for enhancing financial inclusion in developing economies. Finance & Accounting Research Journal. https://fepbl.com/index.php/farj/article/view/856

## Employee Career Decision Making: The Influence of Salary and Benefits, Work Environment and Job Security
Achim, N., Badrolhisam, N. I., & Zulkipli, N. (2019). Employee career decision making: The influence of salary and benefits, work environment and job security. Journal of Academia, 7(Special Issue 1), 41–50.

## The Influence of Salaries and “Opportunity Costs” on Teachers’ Career Choices
Murnane, R. J., Singer, J. D., & Willett, J. B. (1989). The influences of salaries and “opportunity costs” on teachers' career choices: Evidence from North Carolina. Harvard Educational Review, 59(3), 325–349.

## The Future of Work: Impacts of AI on Employment and Job Market Dynamics
Tomar, A., Sharma, S., Arti, & Suman, S. (2024). The future of work: Impacts of AI on employment and job market dynamics. 2024 International Conference on Progressive Innovations in Intelligent Systems and Data Science (ICPIDS).

## AI and Job Market: Analysing the Potential Impact of AI on Employment, Skills, and Job Displacement
Faluyi, S. E. (2025). AI and job market: Analysing the potential impact of AI on employment, skills, and job displacement. African Journal of Marketing Management, 17(1), 1–8.
