---
date: "Monday June 16th, 2025"
---

# Diagnostic Analysis: AI Skills Premium Investigation

Building upon our descriptive analysis of the AI job market landscape, we now shift to diagnostic analytics to investigate why certain patterns exist in compensation across the AI field. This second phase focuses on identifying the causal factors behind salary variations, with particular emphasis on quantifying skill premiums and understanding how they differ across experience levels. By leveraging more sophisticated analytical techniques, we've moved beyond observing what is happening to explaining why these patterns occur.
Our analysis has addressed two key questions so far: (1) Which specific skills or skill combinations command the highest salary premiums, and (2) How these premiums vary across different experience levels from entry to executive positions. To answer these questions, we've performed skills tokenization and frequency analysis to identify high-value technical capabilities, calculated salary differentials for both individual skills and skill combinations, and examined how the value of skills changes throughout career progression.

This diagnostic phase required sophisticated data manipulation, including parsing the required_skills column into individual skill flags, standardizing skill terminology, creating derived features to capture skill domain coverage, and analyzing co-occurrence patterns. The insights generated provide STEAMe's platform users with actionable intelligence about which skills deliver the highest return on investment at each career stage, helping learners make more strategic decisions about their professional development.

Before beginning this analysis, we'll need to ensure our dataset is properly prepared from the descriptive phase, with particular attention to handling the required_skills text field, normalizing education_required categories, and ensuring our numerical variables are appropriately scaled for regression analysis.

## 1. Data Preparation and Enhancement

In this diagnostic analysis phase, we'll build upon our descriptive findings to investigate why certain patterns exist in AI job compensation. Before diving into analytical questions, we need to prepare and enhance our dataset, particularly by transforming unstructured skills data into a format suitable for quantitative analysis. This preparation will allow us to isolate the impact of specific skills, educational requirements, and other factors on salary levels.

The first critical step involves parsing the `required_skills` column, which currently contains comma-separated text listings, into individual skill indicators. This transformation will enable us to measure the frequency and salary premium associated with each skill across the AI job market

In [64]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import re
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import f_regression
import statsmodels.api as sm
from typing import List

# Set visualization styles
plt.style.use("seaborn-v0_8-whitegrid")
sns.set_palette("colorblind")

# Load the dataset from previous phase
full_df = pd.read_csv('data/ai_job_dataset_united_states.csv')

# Display the first few rows to remind ourselves of the structure
print(f"Dataset Shape: {full_df.shape}")
print("\nFirst 5 rows of the dataset:")
display(full_df.head())

# Check the required_skills column format
print("\nSample of required_skills values:")
display(full_df["required_skills"].head(10))

Dataset Shape: (724, 19)

First 5 rows of the dataset:


Unnamed: 0,job_id,job_title,salary_usd,salary_currency,experience_level,employment_type,company_location,company_size,employee_residence,remote_ratio,required_skills,education_required,years_experience,industry,posting_date,application_deadline,job_description_length,benefits_score,company_name
0,AI00022,Autonomous Systems Engineer,102550,USD,MI,PT,United States,M,United States,0,"Tableau, Spark, NLP, TensorFlow, PyTorch",Bachelor,2,Automotive,4/23/2024,6/23/2024,625,10.0,Cognitive Computing
1,AI00041,Data Scientist,96956,USD,MI,FT,United States,M,China,0,"Data Visualization, Azure, Spark, MLOps",Bachelor,2,Telecommunications,2/24/2024,3/9/2024,761,5.3,DataVision Ltd
2,AI00042,AI Architect,196954,USD,SE,FL,United States,L,United States,50,"Java, Mathematics, SQL",Bachelor,8,Finance,3/17/2025,4/16/2025,2290,7.5,DataVision Ltd
3,AI00046,AI Research Scientist,174663,USD,SE,CT,United States,M,Singapore,50,"Data Visualization, Statistics, R",Associate,7,Media,1/10/2024,3/22/2024,1151,5.4,DeepTech Ventures
4,AI00053,Research Scientist,106579,USD,MI,PT,United States,S,United States,100,"Docker, Tableau, Mathematics",Associate,3,Gaming,2/24/2024,5/2/2024,2028,5.4,Cognitive Computing



Sample of required_skills values:


0       Tableau, Spark, NLP, TensorFlow, PyTorch
1        Data Visualization, Azure, Spark, MLOps
2                         Java, Mathematics, SQL
3              Data Visualization, Statistics, R
4                   Docker, Tableau, Mathematics
5                            PyTorch, Linux, SQL
6                               R, NLP, SQL, GCP
7                      Java, Hadoop, Mathematics
8    AWS, Kubernetes, Docker, Mathematics, MLOps
9              Computer Vision, Hadoop, Git, AWS
Name: required_skills, dtype: object

Now that we've loaded our dataset, we'll parse the `required_skills` column into individual skill indicators. This involves:

1. Extracting unique skills from across all job postings
2. Creating binary columns indicating whether each job requires a particular skill
3. Standardizing skill names to account for variations in terminology

In [65]:
# Function to extract and standardize skills
def extract_skills(skills_text: str) -> List[str]:
    if pd.isna(skills_text):
        return []
    
    # Split by comma and strip whitespace
    skills = [s.strip() for s in skills_text.split(',')]

    # Standardize common variations
    skill_mapping = {
        'tensorflow': 'TensorFlow',
        'pytorch': 'PyTorch',
        'scikit-learn': 'Scikit-learn',
        'sklearn': 'Scikit-learn',
        'machine learning': 'Machine Learning',
        'deep learning': 'Deep Learning',
        'natural language processing': 'NLP',
        'computer vision': 'Computer Vision',
        'statistics': 'Statistics',
        'python': 'Python',
        'r': 'R',
        'sql': 'SQL',
        'java': 'Java',
        'c++': 'C++',
        'aws': 'AWS',
        'azure': 'Azure',
        'gcp': 'GCP',
        'google cloud': 'GCP',
        'docker': 'Docker',
        'kubernetes': 'Kubernetes',
        'hadoop': 'Hadoop',
        'spark': 'Spark',
        'tableau': 'Tableau',
        'power bi': 'Power BI',
        'git': 'Git',
        'linux': 'Linux',
        'mathematics': 'Mathematics',
        'mlops': 'MLOps',
    }

    # Standardize skills (case-insensitive mapping)
    standardize_skills = []
    for skill in skills:
        skill_lower = skill.lower()
        if skill_lower in skill_mapping:
            standardize_skills.append(skill_mapping[skill_lower])
        else:
            standardize_skills.append(skill)
    return standardize_skills

In [66]:
# Extract all skills from the dataset
all_skills = []
for skills_text in full_df["required_skills"]:
    skills = extract_skills(skills_text)
    all_skills.extend(skills)

# Count frequency of each skill
skill_counts = pd.Series(all_skills).value_counts()

# Keep only skills that appear in at least 1% of job postings
min_count = len(full_df) * 0.01
common_skills = skill_counts[skill_counts >= min_count].index.tolist()

print(f"Number of common skills: {len(common_skills)}")
print("\nTop 20 most common skills:")
display(skill_counts.head(20))

Number of common skills: 24

Top 20 most common skills:


Python                218
SQL                   167
Kubernetes            155
TensorFlow            154
Scala                 149
Linux                 141
PyTorch               132
Java                  129
Mathematics           120
Computer Vision       118
R                     116
Git                   113
MLOps                 111
Azure                 111
Hadoop                110
GCP                   109
NLP                   105
Tableau               102
Spark                  99
Data Visualization     96
Name: count, dtype: int64

### Creating Individual Skill Flags

In the next cells, we'll parse the `required_skills`column into individual skill flags. We create binary indicator variables for common skills and handle variations in skill naming/formatting.

In [67]:
# Create binary columns for each common skill
for skill in common_skills:
    full_df[f"skill_{skill}"] = full_df["required_skills"].apply(
        lambda x: 1 if skill in extract_skills(x) else 0
    )

# Show the first few rows with the new skill columns
skill_cols = [col for col in full_df.columns if col.startswith("skill_")]
print(f"\nCreated {len(skill_cols)} skill indicator columns")
display(full_df[["job_id", "job_title", "salary_usd"] + skill_cols[:5]].head())


Created 24 skill indicator columns


Unnamed: 0,job_id,job_title,salary_usd,skill_Python,skill_SQL,skill_Kubernetes,skill_TensorFlow,skill_Scala
0,AI00022,Autonomous Systems Engineer,102550,0,0,0,1,0
1,AI00041,Data Scientist,96956,0,0,0,0,0
2,AI00042,AI Architect,196954,0,1,0,0,0
3,AI00046,AI Research Scientist,174663,0,0,0,0,0
4,AI00053,Research Scientist,106579,0,0,0,0,0


In [68]:
# Store the resulting dataframe to a new Excel file
full_df.to_csv("./exports/s2-diagnostic/df_with_skills_pivoted.csv", index=False)

In [11]:
# Create a count of skills required for each job
full_df['skill_count'] = full_df[skill_cols].sum(axis=1)

# Display distribution of skill counts
fig = px.histogram(
    full_df, 
    x='skill_count',
    marginal='box',
    title='Distribution of Number of Skills Required per Job',
    labels={'skill_count': 'Number of Skills'},
    opacity=0.7,
    color_discrete_sequence=['#636EFA']
)
fig.update_layout(
    height=500,
    width=800
)
fig.show()

For all the job titles, there is a minimum of three skills required. There are 245 jobs with three skills required, 236 jobs with four skills required, and 236 jobs with five skills required.

### Normalizing Educational Requirements

In the AI job market, educational requirements are often specified in various formats but follow common patterns. To enable meaningful analysis of how education impacts compensation, we need to standardize these requirements into consistent categories. This step involves mapping variations in educational terminology to a standardized hierarchy (Associate, Bachelor's, Master's, PhD) and creating appropriate categorical variables for our regression models.


In [14]:
# Examine the current distribution of educational requirements
print("Original education requirement categories:")
display(full_df['education_required'].value_counts())

Original education requirement categories:


education_required
Bachelor     206
Master       183
Associate    175
PhD          160
Name: count, dtype: int64

In [15]:
# Create a standardized mapping for education levels
education_mapping = {
    'Associate': 'Associate',
    'Bachelor': 'Bachelor',
    'Master': 'Master',
    'PhD': 'PhD'
}

# Apply mapping to create standardized education column
full_df['education_standardized'] = full_df['education_required'].map(education_mapping)

# Create education level as an ordinal feature (for regression analysis)
education_order = {
    'Associate': 0,
    'Bachelor': 1,
    'Master': 2,
    'PhD': 3
}

full_df['education_level'] = full_df['education_standardized'].map(education_order)

# Create dummy variables for education (for visualization and categorical analysis)
education_dummies = pd.get_dummies(full_df['education_standardized'], prefix='edu')
full_df = pd.concat([full_df, education_dummies], axis=1)

# Verify the new columns
print("\nEducation distribution after standardization:")
display(full_df['education_standardized'].value_counts())


Education distribution after standardization:


education_standardized
Bachelor     206
Master       183
Associate    175
PhD          160
Name: count, dtype: int64

In [16]:
# Visualize average salary by education level
fig = px.box(
    full_df, 
    x='education_standardized', 
    y='salary_usd',
    color='education_standardized',
    title='Salary Distribution by Education Requirement',
    labels={'salary_usd': 'Salary (USD)', 'education_standardized': 'Education Required'},
    category_orders={'education_standardized': ['Associate', 'Bachelor', 'Master', 'PhD']},
    height=500,
    width=800
)
fig.update_layout(showlegend=False)
fig.show()

### Creating Derived Features

To better understand the complex relationships in the AI job market, we need to go beyond individual variables and explore how they interact. Derived features can capture these relationships, such as how experience level might modify the impact of education or how combinations of skills might yield higher premiums than individual skills alone. These engineered features often provide deeper insights than the original variables and can improve the performance of our regression models.

In [35]:
# Create interaction terms between experience level and education
full_df['exp_edu_interaction'] = full_df['education_level'] * full_df['years_experience']

# Create skill combination features based on co-occurrence
print("Creating skill combinations based on co-occurrence:")
# Find pairs of skills that frequently appear together
skill_cooccurrence = {}
for i, row in full_df.iterrows():
    skills_present = [col[6:] for col in skill_cols if row[col] == 1]
    for s1 in range(len(skills_present)):
        for s2 in range(s1+1, len(skills_present)):
            pair = (skills_present[s1], skills_present[s2])
            skill_cooccurrence[pair] = skill_cooccurrence.get(pair, 0) + 1

# Convert to list and sort by frequency
skill_pairs = [(s1, s2, count) for (s1, s2), count in skill_cooccurrence.items()]
skill_pairs.sort(key=lambda x: x[2], reverse=True)

# Print the top co-occurring skill pairs for verification
print("\nTop 10 co-occurring skill pairs:")
for s1, s2, count in skill_pairs[:10]:
    print(f"{s1} & {s2}: {count} occurrences")

# Create features for top co-occurring pairs
for s1, s2, count in skill_pairs[:10]:
    combo_name = f"combo_{s1}_{s2}"
    col1 = f"skill_{s1}"
    col2 = f"skill_{s2}"
    if col1 in full_df.columns and col2 in full_df.columns:
        full_df[combo_name] = full_df[col1] & full_df[col2]
        combo_count = full_df[combo_name].sum()
        avg_salary = full_df.loc[full_df[combo_name] == 1, 'salary_usd'].mean()
        salary_premium = avg_salary - full_df['salary_usd'].mean()
        print(f"{combo_name}: {combo_count} jobs, Avg salary: ${avg_salary:.2f}, Premium: ${salary_premium:.2f}")

# Create bins for years of experience to capture non-linear effects
full_df['experience_bin'] = pd.cut(
    full_df['years_experience'],
    bins=[0, 2, 5, 10, 20],
    labels=['0-2', '3-5', '6-10', '11+']
)

# Create features combining company size and experience level
full_df['size_experience'] = full_df['company_size'] + '_' + full_df['experience_level']

# Create a feature for remote work preference (binary indicator for 100% remote)
full_df['fully_remote'] = (full_df['remote_ratio'] == 100).astype(int)

# Create a feature for job description complexity (might correlate with role sophistication)
full_df['description_length_scaled'] = (full_df['job_description_length'] - full_df['job_description_length'].mean()) / full_df['job_description_length'].std()

# Create a feature for high-paying industries (based on our descriptive analysis)
high_paying_industries = ['Finance', 'Gaming', 'Retail', 'Technology', 'Government']
full_df['high_paying_industry'] = full_df['industry'].isin(high_paying_industries).astype(int)

# Create a feature for skill versatility (having skills across multiple domains)
domains = {
    'programming': ['Python', 'Java', 'C++', 'JavaScript', 'Scala', 'R'],
    'ml_frameworks': ['TensorFlow', 'PyTorch', 'Scikit-learn'],
    'big_data': ['Spark', 'Hadoop', 'SQL'],
    'cloud': ['AWS', 'Azure', 'GCP'],
    'visualization': ['Tableau', 'Power BI'],
    'containerization': ['Docker', 'Kubernetes']
}

for domain, skills in domains.items():
    domain_cols = [f'skill_{skill}' for skill in skills if f'skill_{skill}' in full_df.columns]
    if domain_cols:
        full_df[f'has_{domain}_skills'] = (full_df[domain_cols].sum(axis=1) > 0).astype(int)

full_df['skill_domains_count'] = sum(full_df[f'has_{domain}_skills'] for domain in domains if f'has_{domain}_skills' in full_df.columns)

Creating skill combinations based on co-occurrence:

Top 10 co-occurring skill pairs:
Python & TensorFlow: 61 occurrences
Python & Linux: 40 occurrences
Python & Mathematics: 36 occurrences
Python & Kubernetes: 35 occurrences
Python & Spark: 34 occurrences
Python & MLOps: 34 occurrences
SQL & PyTorch: 32 occurrences
Python & Computer Vision: 32 occurrences
Kubernetes & Scala: 32 occurrences
TensorFlow & PyTorch: 31 occurrences
combo_Python_TensorFlow: 61 jobs, Avg salary: $143086.77, Premium: $-3746.28
combo_Python_Linux: 40 jobs, Avg salary: $138655.42, Premium: $-8177.62
combo_Python_Mathematics: 36 jobs, Avg salary: $146103.03, Premium: $-730.02
combo_Python_Kubernetes: 35 jobs, Avg salary: $144100.80, Premium: $-2732.25
combo_Python_Spark: 34 jobs, Avg salary: $143632.59, Premium: $-3200.46
combo_Python_MLOps: 34 jobs, Avg salary: $128196.00, Premium: $-18637.05
combo_SQL_PyTorch: 32 jobs, Avg salary: $155594.00, Premium: $8760.95
combo_Python_Computer Vision: 32 jobs, Avg salary: 

## Skills Combination Analysis: Key Insights

The analysis of co-occurring skill pairs in AI job postings reveals several important patterns that help explain salary variations in the market:

1. **Python's Dominance**: Python appears in 7 out of 10 top skill combinations, confirming its central role in the AI ecosystem. However, Python alone doesn't necessarily command a salary premium, as most Python combinations show negative salary premiums compared to the overall average.

2. **Value of Specialized Technical Combinations**: The highest salary premium is observed in the Python & Computer Vision combination ($17,219 above average), suggesting that specialization in visual AI applications is particularly valuable. This insight could guide STEAMe's learning pathways for those seeking higher compensation.

3. **Data Processing Expertise**: The SQL & PyTorch combination shows a significant premium ($8,761), indicating that professionals who can both manage data and implement deep learning models are in high demand.

4. **Framework Versatility**: The TensorFlow & PyTorch combination commands a modest premium ($3,921), reflecting the value employers place on flexibility across major deep learning frameworks rather than expertise in just one.

5. **Infrastructure Knowledge**: The Kubernetes & Scala combination shows a positive premium ($3,223), highlighting the value of skills that bridge big data processing and containerized deployment environments.

6. **Surprising MLOps Gap**: Despite industry buzz around MLOps, the Python & MLOps combination shows the largest negative premium (-$18,637), suggesting that this particular skill pairing may be more common in entry-level or lower-paying roles.

7. **Compensation Complexity**: The presence of both positive and negative premiums indicates that skill value is contextual—combinations must be strategic rather than random to maximize earning potential.

These findings provide valuable diagnostic insights for STEAMe's platform, helping to explain why certain AI professionals command higher salaries and enabling more targeted career guidance. The data suggests that rather than accumulating random skills, job seekers should focus on strategic combinations that complement each other and align with higher-paying specializations like computer vision or advanced data processing.

## Visualization of the distribution of skill domains count

In [37]:
# Visualize the distribution of skill domains count
fig = px.histogram(
    full_df, 
    x='skill_domains_count',
    title='Distribution of Skill Domain Coverage',
    labels={'skill_domains_count': 'Number of Skill Domains Covered'},
    color_discrete_sequence=['#636EFA']
)
fig.update_layout(
    height=500,
    width=800
)
fig.show()

The histogram of skill domain coverage reveals important patterns in how AI job requirements span across different technical areas. The distribution shows that most AI positions require skills from 2-3 domains (324 and 231 jobs respectively), with fewer positions at the extremes of the spectrum.

This bell-shaped distribution with a slight right skew indicates that employers typically seek candidates with multi-domain expertise rather than hyper-specialists in a single domain. The relatively small number of positions requiring only one domain (113 jobs) suggests that narrow technical focus may limit job opportunities in the AI field. Similarly, the small proportion of jobs requiring skills across 4 domains (48 jobs) indicates that while breadth is valued, expectations of mastery across the entire technical spectrum remain uncommon.

For STEAMe's workforce development mission, this insight is particularly valuable—it suggests that training programs should encourage learners to develop complementary skills across 2-3 domains (such as programming languages, ML frameworks, and cloud technologies) rather than concentrating exclusively on a single specialty. This balanced approach to skill development appears to align with the actual structure of job requirements in the AI market, potentially increasing employability for platform users.

## Salary vs. Skill Domain Coverage

Next we take the skill domain counts and compute the average salary for each number of skill domains covered.

In [33]:
domain_salary_data = []
for domain_count in sorted(full_df['skill_domains_count'].unique()):
    avg_salary = full_df[full_df['skill_domains_count'] == domain_count]['salary_usd'].mean()
    domain_salary_data.append({
        'Domains Covered': domain_count,
        'Average Salary': avg_salary
    })

domain_salary_df = pd.DataFrame(domain_salary_data)
print("\nAverage salary by number of skill domains covered:")
display(domain_salary_df)


Average salary by number of skill domains covered:


Unnamed: 0,Domains Covered,Average Salary
0,0,174905.0
1,1,153556.070796
2,2,144123.975309
3,3,144454.147186
4,4,153764.8125
5,5,202470.5


In [34]:
# Visualize the relationship between skill domain coverage and salary
fig = px.bar(
    domain_salary_df,
    x='Domains Covered',
    y='Average Salary',
    title='Average Salary by Skill Domain Coverage',
    labels={'Domains Covered': 'Number of Skill Domains Covered', 'Average Salary': 'Average Salary (USD)'},
    color='Average Salary',
    color_continuous_scale='Viridis'
)
fig.update_layout(
    height=500,
    width=800
)
fig.show()

The relationship between skill domain coverage and compensation reveals a fascinating U-shaped pattern that challenges conventional wisdom about skill breadth in the AI job market. While we might expect a linear relationship where more skill domains consistently lead to higher pay, the data shows a more nuanced reality.

Most notably, positions requiring coverage of all 5 skill domains command the highest average salary ($202,471), suggesting that true "full-stack" AI professionals who can work across programming, ML frameworks, big data, cloud, and visualization domains are highly valued but rare. Interestingly, positions with no classified skill domains also show relatively high compensation ($174,905), potentially representing specialized roles with unique skill requirements outside our defined domains or executive positions where technical skills are complemented by leadership abilities.

The middle of the distribution (2-3 domains) shows the lowest average salaries ($144,124 and $144,454 respectively), despite being the most common job configurations as revealed in our previous analysis. This indicates a potential oversupply of professionals with moderate skill breadth, driving down the premium for these profiles despite their popularity in job postings.

For STEAMe's workforce development mission, these findings suggest two viable paths to maximize earning potential: either develop deep expertise in a highly specialized niche that falls outside common domain classifications, or invest in comprehensive cross-domain proficiency spanning all major skill areas. The typical approach of developing moderate breadth across 2-3 domains, while providing more job opportunities, appears to yield lower compensation on average. This insight could help learners make more strategic decisions about their skill development pathways based on their career and compensation goals.

## 2. Skills Premium Analysis on Individual Skills

After examining skill combinations, we now focus on individual skills to identify which specific technical capabilities command the highest salary premiums in the AI job market. This analysis will systematically quantify the value of each skill by calculating its frequency in job postings, determining the average salary for positions requiring that skill, and computing the salary differential (premium) it generates. We'll then explore how these premiums vary across experience levels to provide more targeted insights for different career stages. This approach will help STEAMe's users understand which specific skills offer the greatest return on investment for career development and training.

In [39]:
# 1. Calculate frequency of each skill across the dataset
skill_frequency = full_df[skill_cols].sum().sort_values(ascending=False)
skill_frequency = skill_frequency.reset_index()
skill_frequency.columns = ['skill', 'count']
skill_frequency['percentage'] = (skill_frequency['count'] / len(full_df) * 100).round(2)
skill_frequency['skill'] = skill_frequency['skill'].str.replace('skill_', '')

print("Top skills by frequency:")
display(skill_frequency.head(15))

Top skills by frequency:


Unnamed: 0,skill,count,percentage
0,Python,218,30.11
1,SQL,167,23.07
2,Kubernetes,155,21.41
3,TensorFlow,154,21.27
4,Scala,149,20.58
5,Linux,141,19.48
6,PyTorch,132,18.23
7,Java,129,17.82
8,Mathematics,120,16.57
9,Computer Vision,118,16.3


Next, we determine the average salary for jobs requiring each skill and compute skill premium.

In [40]:
skill_salary_data = []

for skill in skill_cols:
    skill_name = skill.replace('skill_', '')
    jobs_with_skill = full_df[full_df[skill] == 1]
    jobs_without_skill = full_df[full_df[skill] == 0]
    
    if len(jobs_with_skill) > 0 and len(jobs_without_skill) > 0:
        avg_salary_with = jobs_with_skill['salary_usd'].mean()
        avg_salary_without = jobs_without_skill['salary_usd'].mean()
        median_salary_with = jobs_with_skill['salary_usd'].median()
        
        skill_premium = avg_salary_with - avg_salary_without
        premium_percentage = (skill_premium / avg_salary_without) * 100
        
        skill_salary_data.append({
            'skill': skill_name,
            'job_count': len(jobs_with_skill),
            'frequency_pct': len(jobs_with_skill) / len(full_df) * 100,
            'avg_salary': avg_salary_with,
            'median_salary': median_salary_with,
            'skill_premium': skill_premium,
            'premium_percentage': premium_percentage
        })

skill_salary_df = pd.DataFrame(skill_salary_data)
skill_salary_df = skill_salary_df.sort_values('skill_premium', ascending=False)

print("\nTop skills by salary premium:")
display(skill_salary_df.head(15))

print("\nBottom skills by salary premium:")
display(skill_salary_df.tail(15))


Top skills by salary premium:


Unnamed: 0,skill,job_count,frequency_pct,avg_salary,median_salary,skill_premium,premium_percentage
7,Java,129,17.81768,158923.542636,140303.0,14711.798098,10.201526
22,Docker,86,11.878453,159384.732558,141261.0,14243.607166,9.813626
6,PyTorch,132,18.232044,153904.060606,133486.5,8647.660268,5.953376
11,Git,113,15.607735,152124.752212,135799.0,6270.369234,4.299061
20,AWS,91,12.569061,151469.087912,145616.0,5302.519192,3.627724
16,NLP,105,14.502762,151023.761905,135101.0,4901.580968,3.35444
9,Computer Vision,118,16.298343,150580.194915,140365.5,4476.792275,3.064126
4,Scala,149,20.58011,149485.020134,131033.0,3339.181873,2.284829
19,Data Visualization,96,13.259669,149203.854167,122234.0,2733.225186,1.866057
21,Deep Learning,89,12.292818,147961.853933,123058.0,1287.019287,0.877464



Bottom skills by salary premium:


Unnamed: 0,skill,job_count,frequency_pct,avg_salary,median_salary,skill_premium,premium_percentage
21,Deep Learning,89,12.292818,147961.853933,123058.0,1287.019287,0.877464
5,Linux,141,19.475138,147089.06383,130960.0,317.9369,0.216621
13,Azure,111,15.331492,147025.774775,136611.0,227.627956,0.155062
18,Spark,99,13.674033,146259.515152,128063.0,-664.377648,-0.452192
1,SQL,167,23.066298,146263.137725,125079.0,-740.777895,-0.503917
3,TensorFlow,154,21.270718,145335.025974,123737.5,-1902.74771,-1.292296
14,Hadoop,110,15.19337,144547.9,125250.5,-2694.536482,-1.83
15,GCP,109,15.055249,144404.055046,129699.0,-2859.494548,-1.941753
17,Tableau,102,14.088398,143974.078431,128438.5,-3327.80099,-2.259171
8,Mathematics,120,16.574586,143564.775,123709.0,-3917.595861,-2.656315


In [41]:
# Visualize skills by premium
fig = px.bar(
    skill_salary_df.head(15),
    x='skill',
    y='skill_premium',
    color='skill_premium',
    hover_data=['job_count', 'avg_salary', 'premium_percentage'],
    title='Top 15 Skills by Salary Premium',
    labels={
        'skill': 'Skill',
        'skill_premium': 'Salary Premium (USD)',
        'job_count': 'Number of Jobs',
        'premium_percentage': 'Premium (%)'
    },
    color_continuous_scale='Viridis'
)
fig.update_layout(
    xaxis={'categoryorder': 'total descending'},
    height=500,
    width=900
)
fig.show()

### Top Skills by Salary Premium: Key Insights

The analysis of individual skill premiums reveals fascinating patterns about which technical capabilities command the highest value in the AI job market. Most notably, Java stands out with the highest salary premium of $14,712 (10.2%), despite not being traditionally associated as a primary AI language. This suggests that Java's enterprise integration capabilities may be particularly valuable for deploying AI solutions in production environments.

Docker follows closely with a premium of $14,244 (9.8%), highlighting the critical importance of containerization skills for AI deployment and scalability. The substantial premiums for both Java and Docker indicate that implementation and operationalization skills may be more financially rewarded than pure algorithmic expertise.

Among AI-specific technologies, PyTorch shows the strongest premium at $8,648 (6.0%), significantly outperforming TensorFlow, which actually shows a negative premium of -$1,903 (-1.3%). This suggests a market preference for PyTorch's flexibility and research orientation over TensorFlow's more structured approach.

Version control (Git), cloud infrastructure (AWS), and specialized AI domains (NLP, Computer Vision) all command positive premiums in the $4,500-$6,300 range, confirming their value in the ecosystem. Meanwhile, general data technologies like SQL and Spark show slight negative premiums, suggesting these have become baseline expectations rather than differentiating skills.

For STEAMe's workforce development initiatives, these findings highlight the importance of including deployment and operationalization skills alongside core AI techniques in training programs, as the highest premiums appear to reward the ability to implement AI solutions in production environments rather than just develop models.

### Skill Premiums by Experience Level

To understand how the value of different skills varies throughout a career, we now examine how salary premiums change across experience levels. The following code analyzes the top 10 skills with the highest overall premiums and calculates their specific premium at each career stage (Entry, Mid-level, Senior, and Executive). This granular analysis will reveal which skills are particularly valuable for professionals at different points in their career trajectory, providing more targeted guidance for STEAMe's workforce development initiatives.

In [42]:
# 5. Analyze how skill premiums vary by experience level
experience_levels = full_df['experience_level'].unique()
skill_premium_by_exp = {}

# Select top 10 skills with highest premium for the analysis
top_skills = skill_salary_df.head(10)['skill'].tolist()

for skill in top_skills:
    skill_col = f'skill_{skill}'
    if skill_col in full_df.columns:
        premium_by_exp = []
        
        for exp in ['EN', 'MI', 'SE', 'EX']:  # Entry, Mid, Senior, Executive
            exp_df = full_df[full_df['experience_level'] == exp]
            
            if len(exp_df) > 0:
                with_skill = exp_df[exp_df[skill_col] == 1]
                without_skill = exp_df[exp_df[skill_col] == 0]
                
                if len(with_skill) > 5 and len(without_skill) > 5:  # Ensure enough data points
                    avg_with = with_skill['salary_usd'].mean()
                    avg_without = without_skill['salary_usd'].mean()
                    premium = avg_with - avg_without
                    
                    premium_by_exp.append({
                        'experience_level': exp,
                        'premium': premium,
                        'avg_with_skill': avg_with,
                        'job_count': len(with_skill)
                    })
        
        if premium_by_exp:
            premium_df = pd.DataFrame(premium_by_exp)
            skill_premium_by_exp[skill] = premium_df

# Create a visualization of premium by experience level for top skills
premium_by_exp_data = []

for skill, premium_df in skill_premium_by_exp.items():
    for _, row in premium_df.iterrows():
        premium_by_exp_data.append({
            'Skill': skill,
            'Experience Level': row['experience_level'],
            'Premium': row['premium'],
            'Job Count': row['job_count']
        })

premium_by_exp_df = pd.DataFrame(premium_by_exp_data)

# Map experience level codes to more readable labels
exp_level_map = {
    'EN': 'Entry',
    'MI': 'Mid-level',
    'SE': 'Senior',
    'EX': 'Executive'
}
premium_by_exp_df['Experience Level'] = premium_by_exp_df['Experience Level'].map(exp_level_map)

In [43]:
# Visualize how premium varies by experience level
fig = px.bar(
    premium_by_exp_df,
    x='Skill',
    y='Premium',
    color='Experience Level',
    barmode='group',
    title='Skill Premium by Experience Level',
    labels={'Premium': 'Salary Premium (USD)'},
    hover_data=['Job Count'],
    height=600,
    width=1000
)
fig.update_layout(xaxis={'categoryorder': 'total descending'})
fig.show()

### Skill Premium by Experience Level: Key Insights

The analysis of how skill premiums vary across experience levels reveals striking patterns that challenge the notion of universally valuable skills. Most notably, the value of specific technical capabilities changes dramatically throughout a professional's career path:

At the **Executive level**, certain skills command exceptional premiums—PyTorch ($13,569) and Data Visualization ($13,765) stand out as particularly valuable, suggesting that executives who maintain technical proficiency in these areas are highly sought after. Conversely, Computer Vision shows a substantial negative premium (-$11,883) at this level, indicating it may be perceived as overly specialized for executive roles.

For **Senior positions**, AWS emerges as the only skill with a substantial positive premium ($8,396), highlighting the value of cloud architecture expertise at this career stage. Most other skills show negative premiums, suggesting that by the senior level, specific technical skills become less differentiating than broader competencies.

At **Mid-level**, most skills show modest positive premiums, with Docker ($3,148), Java ($3,100), and PyTorch ($2,676) leading the pack. Interestingly, AWS shows a significant negative premium (-$5,154) at this level despite its value for senior roles.

For **Entry-level** positions, Scala shows the highest premium ($4,580), followed by Docker ($1,664) and Deep Learning ($1,347), while PyTorch (-$3,681) and NLP (-$5,130) actually reduce expected compensation.

These findings have profound implications for STEAMe's workforce development initiatives, suggesting that optimal skill development pathways should evolve as careers progress. Early-career professionals should prioritize foundational skills like Scala and Docker, mid-career individuals should develop Java and PyTorch expertise, senior professionals should focus on AWS, while executives may benefit from maintaining proficiency in PyTorch and Data Visualization while de-emphasizing Computer Vision.

## 3. Regression Analysis for Salary Drivers

After examining individual skills and their varying premiums across experience levels, we now need a more holistic understanding of salary determinants in the AI job market. While our previous analysis revealed valuable insights about specific skills like PyTorch commanding significant premiums at the executive level or AWS being particularly valuable for senior roles, these bivariate relationships don't account for the complex interplay of multiple factors simultaneously affecting compensation.

Regression analysis allows us to move beyond isolated skill premiums to build a comprehensive model that controls for confounding variables and quantifies the relative importance of different factors. By incorporating experience level, education, industry, company size, skill requirements, and other variables into a unified model, we can identify which factors truly drive salary differences when all else is held constant. This approach helps us distinguish correlation from causation and provides a more nuanced understanding of the AI compensation landscape.

For STEAMe's workforce development mission, these regression insights will be particularly valuable. They will help learners understand which factors most significantly impact earning potential, enabling more strategic career planning and skill development. Additionally, the model will quantify the expected return on investment for different educational pathways and skill acquisitions, supporting data-driven decision making for both individuals and training providers.

### Preparation for Regression Analysis

Before building our regression model to identify the key determinants of AI job salaries, we must carefully prepare our dataset. This preparation involves selecting relevant features while avoiding multicollinearity, handling categorical variables appropriately, and addressing any outliers that might skew our results. Since our goal is to create an interpretable model that isolates the impact of various factors on compensation, we'll be particularly careful to include variables from our skill premium analysis while ensuring they don't overlap excessively with other predictors. This groundwork will allow us to build a robust model that provides reliable insights into what truly drives salary differences in the AI job market.

In [45]:
# Import necessary libraries for regression analysis
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

# Start with our feature-enhanced dataset
# Ensure we're working with the most recent version that includes all our derived features
print("Shape of full dataset:", full_df.shape)

# Select potential predictor variables
# Include numerical features
numerical_features = [
    'years_experience',
    'education_level',
    'remote_ratio',
    'job_description_length',
    'benefits_score',
    'skill_count',
    'skill_domains_count'
]

# Select categorical features to include
categorical_features = [
    'experience_level',
    'employment_type',
    'company_size',
    'industry',
    'job_title'
]

# Include top skills from our premium analysis
top_skill_cols = [f"skill_{skill}" for skill in ['Java', 'Docker', 'PyTorch', 'Git', 'AWS', 
                                                'NLP', 'Computer Vision', 'Scala', 
                                                'Data Visualization', 'Deep Learning']]

# Include the top skill combination features
combo_cols = [col for col in full_df.columns if col.startswith('combo_')][:5]  # Top 5 combinations

# Combine all selected features
selected_features = numerical_features + categorical_features + top_skill_cols + combo_cols

# Check if we have any missing values in these columns
missing_values = full_df[selected_features + ['salary_usd']].isnull().sum()
if missing_values.sum() > 0:
    print("Missing values detected:")
    display(missing_values[missing_values > 0])
else:
    print("No missing values in selected features.")

# Create dummy variables for categorical features
model_df = pd.get_dummies(
    full_df[selected_features + ['salary_usd']], 
    columns=categorical_features,
    drop_first=True  # Remove one category per feature to avoid perfect multicollinearity
)

# Check the shape after creating dummies
print(f"Shape after creating dummy variables: {model_df.shape}")

Shape of full dataset: (724, 81)
No missing values in selected features.
Shape after creating dummy variables: (724, 64)


### Feature Selection and Preparation

In this initial data preparation step, we've carefully selected variables from our enhanced dataset to include in the regression model. We've organized predictors into several categories: numerical features (like years of experience and education level), categorical variables (such as experience level and industry), top individual skills that showed high premiums in our earlier analysis, and the most frequent skill combinations. After confirming no missing values exist in our selected features, we've transformed categorical variables into dummy indicators using one-hot encoding with the drop-first approach to avoid perfect multicollinearity. This transformation expanded our feature space considerably, converting our original variables into a regression-ready format where each categorical level becomes its own binary predictor.

### Addressing Multicollinearity

Before building our regression model, we need to identify and address multicollinearity—a situation where predictor variables are highly correlated with each other. Multicollinearity can destabilize regression coefficients and make it difficult to isolate the true effect of individual variables on salary. To detect this issue, we calculate the Variance Inflation Factor (VIF) for each predictor, which measures how much the variance of a coefficient is inflated due to correlation with other predictors. After converting all variables to numeric format for compatibility with the VIF calculation, we identify and remove features with extreme multicollinearity (VIF > 20), ensuring our final model will provide more reliable and interpretable estimates of each factor's contribution to AI job compensation.

In [48]:
# Check for multicollinearity using Variance Inflation Factor (VIF)
# First, create a dataframe with just the predictor variables
X = model_df.drop('salary_usd', axis=1)
y = model_df['salary_usd']

# Convert all columns to numeric type
X_numeric = X.astype(float)

# Check if conversion was successful
print(f"Data type of X_numeric: {X_numeric.dtypes.unique()}")

# Calculate VIF for each feature
vif_data = pd.DataFrame()
vif_data["Feature"] = X_numeric.columns
vif_data["VIF"] = [variance_inflation_factor(X_numeric.values, i) for i in range(X_numeric.shape[1])]

# Sort by VIF value
vif_data = vif_data.sort_values('VIF', ascending=False)

# Display features with high multicollinearity (VIF > 10)
print("Features with high multicollinearity (VIF > 10):")
display(vif_data[vif_data["VIF"] > 10])

# Remove features with extreme multicollinearity (VIF > 20)
high_vif_features = vif_data[vif_data["VIF"] > 20]["Feature"].tolist()
if high_vif_features:
    print(f"Removing {len(high_vif_features)} features with extreme multicollinearity")
    X = X.drop(columns=high_vif_features)
    # Recalculate X_numeric without high VIF features
    X_numeric = X.astype(float)
else:
    print("No features with extreme multicollinearity detected")

Data type of X_numeric: [dtype('float64')]
Features with high multicollinearity (VIF > 10):


Unnamed: 0,Feature,VIF
5,skill_count,44.858827
0,years_experience,26.273951
4,benefits_score,21.428173
22,experience_level_EX,20.97044
6,skill_domains_count,17.403493


Removing 4 features with extreme multicollinearity


### Data Preparation Insights

The multicollinearity analysis revealed several highly correlated predictors in our dataset. Most notably, skill_count showed extreme multicollinearity (VIF = 44.86), likely because it's mathematically derived from our individual skill indicators. Similarly, years_experience (VIF = 26.27) and experience_level_EX (VIF = 20.97) exhibited high correlation, as they capture related aspects of professional seniority. By removing these problematic variables with VIF values above 20, we've created a cleaner dataset that will produce more stable and interpretable regression coefficients. This preparation ensures our model can more accurately isolate the unique contribution of each factor to AI job salaries without the distortion that comes from redundant predictors.

### Checking for Outliers in the Target Variable

To ensure our regression model isn't unduly influenced by extreme salary values, we'll identify potential outliers using the interquartile range (IQR) method. This statistical technique defines outliers as values falling below Q1-1.5×IQR or above Q3+1.5×IQR, where Q1 and Q3 are the 25th and 75th percentiles of the salary distribution. By flagging these extreme cases, we can assess their potential impact on our model and consider creating alternative models with and without these observations.

In [49]:
# Check for outliers in the target variable
Q1 = y.quantile(0.25)
Q3 = y.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers = y[(y < lower_bound) | (y > upper_bound)]
print(f"Number of outliers in salary: {len(outliers)} ({len(outliers)/len(y):.2%} of data)")

Number of outliers in salary: 12 (1.66% of data)


Our analysis identified only 12 salary outliers, representing a modest 1.66% of the dataset. This relatively small proportion suggests that while there are some unusually high or low compensation packages in the AI job market, they don't constitute a major part of our data. Since these outliers represent legitimate market phenomena rather than data errors, we'll retain them in our primary analysis while remaining mindful of their potential influence on our model coefficients.

## Feature Scaling and Train-Test Split

As a final preparation step before modeling, we'll standardize our numerical features and divide our dataset into training and testing portions. Standardization transforms features to have zero mean and unit variance, which helps with coefficient interpretability and prevents variables with larger scales from dominating the model. By maintaining both standardized and unstandardized versions of our data, we can choose the most appropriate format for different modeling techniques. The train-test split reserves 20% of our data for model validation, allowing us to assess how well our findings will generalize to new AI job market data.

In [56]:
# Standardize numerical predictors for easier interpretation of coefficients
# We'll keep both standardized and unstandardized versions
scaler = StandardScaler()
numerical_cols = [col for col in X.columns if col in numerical_features]
X_std = X.copy()
X_std[numerical_cols] = scaler.fit_transform(X[numerical_cols])

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train_std, X_test_std, y_train_std, y_test_std = train_test_split(X_std, y, test_size=0.2, random_state=42)

print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")

# Final check of our prepared data
print("\nFirst few rows of prepared features:")
display(X.head())

Training set shape: (579, 59)
Testing set shape: (145, 59)

First few rows of prepared features:


Unnamed: 0,education_level,remote_ratio,job_description_length,skill_domains_count,skill_Java,skill_Docker,skill_PyTorch,skill_Git,skill_AWS,skill_NLP,...,job_title_Data Scientist,job_title_Deep Learning Engineer,job_title_Head of AI,job_title_ML Ops Engineer,job_title_Machine Learning Engineer,job_title_Machine Learning Researcher,job_title_NLP Engineer,job_title_Principal Data Scientist,job_title_Research Scientist,job_title_Robotics Engineer
0,1,0,625,3,0,0,1,0,0,1,...,False,False,False,False,False,False,False,False,False,False
1,1,0,761,2,0,0,0,0,0,0,...,True,False,False,False,False,False,False,False,False,False
2,1,50,2290,2,1,0,0,0,0,0,...,False,False,False,False,False,False,False,False,False,False
3,0,50,1151,1,0,0,0,0,0,0,...,False,False,False,False,False,False,False,False,False,False
4,0,100,2028,2,0,1,0,0,0,0,...,False,False,False,False,False,False,False,False,True,False


In [54]:
# Save the prepared data for the next steps
regression_data = {
    'X': X,
    'y': y,
    'X_std': X_std,
    'X_train': X_train,
    'X_test': X_test,
    'y_train': y_train,
    'y_test': y_test,
    'X_train_std': X_train_std,
    'X_test_std': X_test_std,
}

### Data Preparation Summary

We have successfully prepared a robust dataset for our regression analysis of AI job salary drivers. Through careful feature selection, we identified relevant predictors from our enhanced dataset, including numerical variables, categorical factors, and key skills identified in our premium analysis. We addressed multicollinearity by calculating variance inflation factors and removing highly correlated predictors that could destabilize our model. After identifying a small number of salary outliers, standardizing numerical features for better interpretability, and splitting our data into training and testing sets, we've stored all prepared datasets in a dictionary for easy access in the next step. With this thorough preparation, we're now ready to build a multiple linear regression model using statsmodels that will reveal the key factors driving compensation in the AI job market and quantify their relative importance.

### Build Linear Regression Model

Now that we've prepared our data and addressed potential issues like multicollinearity and outliers, we're ready to build a multiple linear regression model to identify the key drivers of AI job salaries. This model will quantify the impact of various factors on compensation while controlling for other variables, allowing us to isolate the unique contribution of each predictor. Unlike our earlier bivariate analyses of skill premiums, regression provides a more comprehensive understanding by simultaneously considering all relevant factors. We'll start with a standard Ordinary Least Squares (OLS) model using statsmodels, which provides detailed statistical outputs including coefficient estimates, confidence intervals, p-values, and overall model fit statistics.

In [63]:
# Build a linear regression model using statsmodels for detailed statistics
# We'll use the standardized features for better coefficient interpretability
X_train_sm = sm.add_constant(X_train_std)  # Add intercept term
model = sm.OLS(np.asarray(y_train.astype(float)), np.asarray(X_train_sm.astype(float)))
results = model.fit()

# Get model summary statistics
print("Model Summary:")
print(results.summary())

# Extract key model statistics
r_squared = results.rsquared
adj_r_squared = results.rsquared_adj
f_stat = results.fvalue
f_pvalue = results.f_pvalue
aic = results.aic
bic = results.bic

# Create a summary of model performance
model_stats = pd.DataFrame({
    'Metric': ['R-squared', 'Adjusted R-squared', 'F-statistic', 'F p-value', 'AIC', 'BIC'],
    'Value': [r_squared, adj_r_squared, f_stat, f_pvalue, aic, bic]
})

Model Summary:
                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.258
Model:                            OLS   Adj. R-squared:                  0.174
Method:                 Least Squares   F-statistic:                     3.063
Date:                Mon, 16 Jun 2025   Prob (F-statistic):           1.00e-11
Time:                        19:03:58   Log-Likelihood:                -7172.6
No. Observations:                 579   AIC:                         1.447e+04
Df Residuals:                     519   BIC:                         1.473e+04
Df Model:                          59                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       2.054e+05   1.68e+04     

Our multiple linear regression model explains approximately 25.8% of the variation in AI job salaries (R-squared = 0.258), with an adjusted R-squared of 0.174 that accounts for the number of predictors used. While this explanatory power is modest, the model is statistically significant overall (F-statistic p-value < 0.001), indicating that our predictors collectively have meaningful relationships with salary levels.

Several individual predictors show statistically significant effects (p < 0.05):
- Variable x10 has a positive coefficient (approximately +$16,200), suggesting it increases expected salary
- Variables x17, x18, x20, x21, x22, x25, x26, and x49 all have negative coefficients, indicating they're associated with lower expected salaries when controlling for other factors

The model satisfies basic regression assumptions, with a Durbin-Watson statistic near 2 (1.962) indicating no significant autocorrelation. However, there's slight evidence of non-normality in the residuals (Omnibus test p-value = 0.014), which may warrant further investigation.

The relatively low R-squared suggests that while we've captured some important salary drivers, there are likely additional factors or non-linear relationships not captured by our current model. We should consider examining the significant predictors more closely to understand what specific skills, experience levels, or industry factors are most strongly influencing AI job compensation.

## Next Steps and Key Compensation Drivers

Given the model's modest explanatory power (R-squared = 0.258), we should take several steps to improve our understanding of AI salary determinants:

1. **Examine Variable Importance**: Identify which specific factors the significant variables (x10, x17, x18, etc.) represent in our dataset. These numeric placeholders obscure the actual predictors driving compensation differences.

2. **Consider Non-Linear Relationships**: The current model assumes linear relationships between predictors and salary. We should explore polynomial terms or interaction effects, particularly between experience level and skills, as our earlier analysis showed skill premiums vary substantially across career stages.

3. **Feature Selection Refinement**: Implement a more systematic approach to feature selection, such as stepwise regression or LASSO, to identify the most relevant predictors and create a more useful model.

4. **Alternative Modeling Approaches**: Consider more flexible models like random forests or gradient boosting machines that can capture non-linear patterns and complex interactions.

Based on our previous analyses, several key factors appear to drive AI compensation:

1. **Experience Level**: Executive positions command substantial premiums, particularly when combined with certain technical skills.

2. **High-Value Skills**: Java, Docker, and PyTorch showed the highest overall premiums, with PyTorch being particularly valuable at the executive level.

3. **Skill Domain Breadth**: Jobs requiring skills across multiple domains or complete coverage of all domains showed higher compensation than those with moderate skill breadth.

4. **Industry Variation**: Finance, Gaming, and Retail industries offered higher median salaries than others, suggesting industry-specific factors significantly influence compensation.

By refining our modeling approach and focusing on these key drivers, we can develop a more comprehensive understanding of the factors that truly determine compensation in the AI job market, providing STEAMe's users with more actionable intelligence for career planning and skill development.