### *2025 AI Job Market Analysis: Salaries, Skills, and Hiring Trends*

#### 🎯 *Business Problem*
*In the dynamic and fast-growing field of artificial intelligence, both companies and professionals face significant challenges in understanding global hiring trends, salary expectations, and skill demands. As AI roles continue to diversify across industries and regions, it's crucial to identify how factors such as job title, experience level, company size, employment type, remote flexibility, required skills, and educational background influence compensation and hiring patterns.*

*This project aims to analyze a global dataset of AI-related job postings from 2025 to uncover key insights on:*

- *Competitive salary ranges across roles, locations, and experience levels*
- *The impact of remote work, required skills, and education on job offers*
- *Application timelines and company-level hiring behaviors*

*The ultimate goal is to help:*

- *Organizations benchmark salaries, structure job roles, and optimize talent acquisition strategies*
- *Job seekers identify high-growth opportunities, understand skill-based compensation, and align their career planning with market demands*

*This end-to-end data analysis project covers data cleaning, feature engineering, trend analysis, and insight generation — providing a real-world foundation for AI job market intelligence.*

#### 🔧 *Tools Used* 
*Python, Pandas, Matplotlib, Seaborn*

#### *Size of Dataset 2608 KB*


#### *Author* - *Niranjan (Data Analyst)*

### *Environmental Setup*
#### *⚙️ Importing Essential Libraries*

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import warnings
warnings.filterwarnings("ignore")

### *Data Loading*
#### *📂 Loading the Dataset*

In [None]:
df = pd.read_csv("ai_job_dataset.csv")

### *Data Understanding*
#### *🔍Initial Data Inspection*

In [None]:
df.shape

In [None]:
df.columns.tolist()

In [None]:
df.dtypes

In [None]:
df.head(3)

In [None]:
df.isna().sum()

### *Data Exploration*
#### *🔎 Analyzing Patterns & Distributions*

#### 🆔 *Job ID*
- *Data type is object*
- *All unique values*
- *We can drop the column as Id is not needed for Data Analysis*
- *No missing vlues present*
- *all unique values in this column*

In [None]:
df['job_id'].dtype

In [None]:
df["job_id"].nunique()

In [None]:
df["job_id"].unique()

#### 🧠 *Job Title*
- *Data type is object*
- *All categories of job tittle*
- *No missing avlues present*
- *Total 20 unique values*
- *Machine Learning Researcher has the highest*
- *Some of them are repeating*
- *2nd highest is AI Software Engineer*
- *No missing values present in the column*

In [None]:
df['job_title'].dtype

In [None]:
df["job_title"].nunique()

In [None]:
df["job_title"].unique()

In [None]:
df["job_title"].value_counts()

In [None]:
top_10_jobs = df['job_title'].value_counts().nlargest(10)

plt.figure(figsize=(10, 6))
sns.barplot(x=top_10_jobs.values, y=top_10_jobs.index, palette='viridis')
plt.title('Top 10 Job Titles by Count')
plt.xlabel('Count')
plt.ylabel('Job Title')
plt.tight_layout()
plt.show()

#### 💵 *Salary usd*
- *Continous Column*
- *14315 unique values*
- *Some of the job tittles are repeating*
- *Right Skewed*
- *Outlier Present*
- *No missing values present*

In [None]:
df['salary_usd'].dtype

In [None]:
df["salary_usd"].nunique()

In [None]:
df["salary_usd"].unique()

In [None]:
df["salary_usd"].describe()

In [None]:
df["salary_usd"].skew()

In [None]:
sns.histplot(data = df, x = "salary_usd", kde = True)

In [None]:
sns.boxplot(data = df, x = "salary_usd")

#### 💵 *Salary currency*
- *Data type is object*
- *3 unique values*
- *Categorical Column*
- *USD is the highest*
- *Great Britain Pound is the lowest*
- *No missing values present in the column*

In [None]:
df['salary_currency'].dtype

In [None]:
df['salary_currency'].nunique()

In [None]:
df['salary_currency'].unique()

In [None]:
df['salary_currency'].value_counts()

In [None]:
sns.countplot(x='salary_currency', data = df ,palette = "viridis")

#### 📈 *Experience level*
- *Data type is object*
- *4 distinct unique categories*
- *Highest employee is 	Mid-level (experienced but not senior)*
- *Lowest is entry level*
- *No missing values present*

In [None]:
df['experience_level'].dtype

In [None]:
df['experience_level'].nunique()

In [None]:
df['experience_level'].unique()

In [None]:
df['experience_level'].value_counts()

In [None]:
sns.countplot(x='experience_level', data = df,palette = "viridis")

#### 👔 *Employment type*
- *Data type is object*
- *Categorical column*
- *4 unique categories*
- *Full - time employement is the highest*
- *Part - Time employment is the lowest*
- *No missing values present in the data set*

In [None]:
df['employment_type'].dtype

In [None]:
df['employment_type'].nunique()

In [None]:
df['employment_type'].unique()

In [None]:
df['employment_type'].value_counts()

In [None]:
sns.countplot(x='employment_type', data=df,palette = "viridis")

#### 🌍 *Company location*
- *Data type is object*
- *20 unique locations*
- *20 Company Locations*
- *Germany is the highest among them*
- *Norway is the lowest among them*
- *No missing values present in the data set*

In [None]:
df['company_location'].dtype

In [None]:
df['company_location'].nunique()

In [None]:
df['company_location'].unique()

In [None]:
df['company_location'].value_counts()

In [None]:
top_10_jobs = df['company_location'].value_counts().nlargest(10)

plt.figure(figsize=(10, 6))
sns.barplot(x=top_10_jobs.values, y=top_10_jobs.index, palette='viridis')
plt.title('Top 10 Company location')
plt.xlabel('Count')
plt.ylabel('Job Location')
plt.tight_layout()
plt.show()

#### 🌍 *Company size*
- *Data type is object*
- *4 unique values or catgeries*
- *Categorical colummn*
- *Categorical column*
- *Data Type object*
- *Comany size is small is the highest*
- *No missing values present in the data set*

In [None]:
df['company_size'].dtype

In [None]:
df['company_size'].nunique()

In [None]:
df['company_size'].unique()

In [None]:
df['company_size'].value_counts()

In [None]:
sns.countplot(x='company_size', data=df,palette = "viridis")

#### 🧑‍💼 *Employee residence*
- *Data type is object*
- *20 unique employees residencr locations*
- *Data Type object*
- *Most employees are from sweeden*
- *Japan is the lowest*
- *No missing values present in the data set*

In [None]:
df['employee_residence'].dtype

In [None]:
df['employee_residence'].nunique()

In [None]:
df['employee_residence'].unique()

In [None]:
df['employee_residence'].value_counts()

#### 🌐 *Remote ratio*
- *Int data type*
- *Categorical column*
- *3 Categories present in the data set*
- *No missing values present in the data set*
- *Large number are on Fully Onsite (No remote work)*

In [None]:
df['remote_ratio'].dtype

In [None]:
df['remote_ratio'].nunique()

In [None]:
df['remote_ratio'].unique()

In [None]:
df['remote_ratio'].value_counts()

#### 🧑‍🔬 *Required skills*
- *Data Type is object*
- *Python, TensorFlow, PyTorch*
- *No missing values present in the data set*
- *13663 unique skills in the data set*

In [None]:
df['required_skills'].dtype

In [None]:
df['required_skills'].nunique()

In [None]:
df['required_skills'].unique()

In [None]:
df['required_skills'].value_counts()

In [None]:
top_10_jobs = df['required_skills'].value_counts().nlargest(10)

plt.figure(figsize=(10, 6))
sns.barplot(x=top_10_jobs.values, y=top_10_jobs.index, palette='viridis')
plt.title('Top 10 Required skills')
plt.xlabel('Count')
plt.ylabel('Skills')
plt.tight_layout()
plt.show()

#### 🎓 *Education Required*
- *Data Type is object*
- *4 unique values or columns*
- *Bachelor is the highest with the numbers as 3789*
- *while phd is the lowest with 3678 numbers*
- *No missing values present*

In [None]:
df['education_required'].dtype

In [None]:
df['education_required'].nunique()

In [None]:
df['education_required'].unique()

In [None]:
df['education_required'].value_counts()

#### 🎓 *Years experience*
- *Data Type is Int*
- *20 unique values*
- *Years of experiece*
- *Most are fresher*
- *Send most highest is the employees with 1 year of experiece*
- *No missing values present in the column*
- *No outliers present*

In [None]:
df['years_experience'].dtype

In [None]:
df['years_experience'].nunique()

In [None]:
df['years_experience'].unique()

In [None]:
df['years_experience'].value_counts()

In [None]:
sns.boxplot(data = df, x = 'years_experience')

In [None]:
top_10_jobs = df['years_experience'].value_counts().nlargest(10)

plt.figure(figsize=(10, 6))
sns.barplot(x=top_10_jobs.index, y=top_10_jobs.values, palette='viridis')
plt.title('Top 10 Experience in years')
plt.xlabel('Count')
plt.ylabel('Years of experice')
plt.tight_layout()
plt.show()

#### 🎓 *Industry*
- *Data type is object*
- *15 unique values*
- *Categorical column*
- *Highest from the retail industry*
- *Lowest is from the education industry*
- *No missing values presetn in the colummn*

In [None]:
df['industry'].dtype

In [None]:
df['industry'].nunique()

In [None]:
df['industry'].unique()

In [None]:
df['industry'].value_counts()

#### 📅 *Posting date*
- *Time series column*
- *Data type is object*
- *Posting data records are from - '2024-01-01', '2025-04-30'*
- *Most of the posting date are from 2024-07-05*

In [None]:
df['posting_date'].dtype

In [None]:
df['posting_date'].min(),df['posting_date'].max()

In [None]:
df['posting_date'].value_counts().sort_index()

#### 📅 *Application Deadline*
- *Data type is object*
- *534 unique values*
- *Records are from '2024-01-16', '2025-07-11'*
- *No missing values in the application deadline column*
- *the minimumn and maximumn application deadlines are from - '2024-01-16', '2025-07-11'*
- *Most of them have a application deadline as - 2025-01-05*

In [None]:
df['application_deadline'].dtype

In [None]:
df['application_deadline'].min(),df['application_deadline'].max()

In [None]:
df['application_deadline'].value_counts().sort_index()

#### 📄 *Job description length*
- *Data type is int64*
- *Total unique values is 2000*
- *highest job description is 1076 characters and the lowest is with 14999 characters with the numbers as 2492*
- *No missing values or nul values*

In [None]:
df['job_description_length'].dtype

In [None]:
df['job_description_length'].nunique()

In [None]:
df['job_description_length'].unique()

In [None]:
df['job_description_length'].head(5)

#### 🎁 *Benefits score*
- *Data type is float*
- *Total number of unique values is 51*
- *Categorical Columns*
- *Most benefit score is 9.9*
- *Lowest benefit score is 5.0*
- *No missing avlues in the benefit score column*

In [None]:
df['benefits_score'].dtype

In [None]:
df['benefits_score'].nunique()

In [None]:
df['benefits_score'].unique()

In [None]:
df['benefits_score'].value_counts().sort_index()

In [None]:
sns.boxplot(data = df, x = 'benefits_score')

#### 🏢 *Company name*
- *Data Type is object*
- *51 unique values*
- *Most of the records are from TechCorp Inc*
- *Algorithmic Solutions have the lowest numbers of records*

In [None]:
df['company_name'].dtype

In [None]:
df['company_name'].nunique()

In [None]:
df['company_name'].unique()

In [None]:
df['company_name'].value_counts()

### *Data Cleaning*
#### *🔧 Preprocessing & Imputation*

#### *Droping columns that are not necessary for the data analysis as per the business problem*

In [None]:
# List of columns to drop
cols_to_drop = [
    'job_id',
    'salary_currency',
    'job_description_length'
]

# Drop the columns
df.drop(columns=cols_to_drop, inplace=True)

#### *Categorical Variable Mapping for Clarity*

In [None]:
exp_map = {
    'EN': 'Entry-Level',
    'MI': 'Mid-Level',
    'SE': 'Senior-Level',
    'EX': 'Executive'
}

df['experience_level'] = df['experience_level'].map(exp_map)

In [None]:
df['experience_level'].unique()

In [None]:
emp_map = {
    'FT': 'Full-Time',
    'PT': 'Part-Time',
    'CT': 'Contract',
    'FL': 'Freelance'
}

df['employment_type'] = df['employment_type'].map(emp_map)

In [None]:
df['employment_type'].unique()

In [None]:
size_map = {
    'S': 'Small',
    'M': 'Medium',
    'L': 'Large'
}

df['company_size'] = df['company_size'].map(size_map)

In [None]:
df['company_size'].unique()

In [None]:
remote_map = {
    0: 'On-site',
    50: 'Hybrid',
    100: 'Fully Remote'
}

df['remote_ratio'] = df['remote_ratio'].map(remote_map)

In [None]:
df['remote_ratio'].unique()

#### *Extracting Month from posting date*

In [None]:
df['posting_date'] = pd.to_datetime(df['posting_date'], format='%d-%m-%Y')
df['posting_month'] = df['posting_date'].dt.strftime('%B')

In [None]:
df['posting_month'].unique()

In [None]:
df['posting_month'].value_counts()

#### *Extracting Month from Application deadline*

In [None]:
df['application_deadline'] = pd.to_datetime(df['application_deadline'], format='%d-%m-%Y')
df['application_month'] = df['application_deadline'].dt.strftime('%B')

In [None]:
df['application_month'].unique()

In [None]:
df['application_month'].value_counts()

#### *Feature Enginerring*

In [None]:
df['application_duration'] = (df['application_deadline'] - df['posting_date']).dt.days

In [None]:
df['application_duration'].unique()

In [None]:
df['application_duration'].describe()

In [None]:
bins = [0, 20, 40, 60, 100]
labels = ['Short', 'Medium', 'Long', 'Very Long']
df['application_duration_category'] = pd.cut(df['application_duration'], bins=bins, labels=labels)

In [None]:
df['application_duration_category'].unique()

#### *Renaming the columns*

In [None]:
print(df.columns)

In [None]:
# Example: rename specific columns
df.rename(columns={
    'job_title': 'Job_Title',
    'salary_usd': 'Salary_USD',
    'experience_level': 'Experience_Level',
    'employment_type': 'Employment_Type',
    'company_location': 'Company_Location',
    'company_size' : 'Company_Size',
    'employee_residence': 'Employee_Residence',
    'remote_ratio': 'Remote_Ratio',
    'required_skills': 'Required_Skills',
    'education_required': 'Education_Required',
    'years_experience' : 'Years_Experience',
    'industry': 'Industry',
    'posting_date': 'Posting_Date',
    'application_deadline' : 'Application_Deadline',
    'benefits_score' : 'Benefits_Score',
    'company_name':'Company_Name',
    'posting_month': 'Posting_Month',
    'application_month' : 'Application_Month', 
    'application_duration' : 'Application_Duration',
     'application_duration_category': 'Application_Duration_Category'
}, inplace=True)

In [None]:
df.head(3)

#### *Check how many duplicate rows exist*

In [None]:
duplicate_count = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")

In [None]:
print(df.columns)

#### *Categorizing the columns on the basis of their types*

In [None]:
continuous = ['Salary_USD', 'Years_Experience', 'Benefits_Score', 'Application_Duration']

categorical = [
    'Job_Title', 'Experience_Level', 'Employment_Type', 'Company_Size',
    'Education_Required', 'Industry', 'Company_Name', 'Application_Duration_Category'
]

count = ['Company_Location', 'Employee_Residence']

time = ['Posting_Date', 'Application_Deadline', 'Posting_Month', 'Application_Month']

text = ['Required_Skills']  # Multi-label, handle separately

special_case = ['Remote_Ratio']  # Limited discrete values (0, 50, 100)

### *Data Analysis*
#### *Business Insights Extraction*

#### *Univariate Analysis - Continous Variable*

In [None]:
continuous = ['Salary_USD', 'Years_Experience', 'Benefits_Score', 'Application_Duration']

#### 🔍 *1. What is the overall distribution of salaries in AI job roles across 2025? Are there any noticeable outliers or skewness?*

In [None]:
plt.figure(figsize=(14,5))
sns.histplot(df['Salary_USD'], bins=50, kde=True, color='skyblue')
plt.title('Distribution of AI Salaries (USD)')
plt.xlabel('Salary in USD')
plt.ylabel('Frequency')
plt.show()

print(df['Salary_USD'].describe())

#### 🔍 *2. What is the typical experience required for AI roles? Are most jobs for juniors, mid-levels, or seniors?*

In [None]:
plt.figure(figsize=(14,5))
sns.histplot(df['Years_Experience'], bins=20, kde=True, color='orange')
plt.title('Distribution of Required Years of Experience')
plt.xlabel('Years of Experience')
plt.ylabel('Frequency')
plt.show()

print(df['Years_Experience'].describe())

#### 🔍 *3. How are companies rating their job benefits for AI roles? Is there a consistent scoring trend?*

In [None]:
plt.figure(figsize=(14,5))
sns.histplot(df['Benefits_Score'], bins=25, kde=True, color='green')
plt.title('Distribution of Benefits Score')
plt.xlabel('Benefits Score')
plt.ylabel('Frequency')
plt.show()

print(df['Benefits_Score'].describe())

#### 🔍 *4. How much time do companies typically allow for job applications? Are deadlines tight or relaxed?*

In [None]:
plt.figure(figsize=(14,5))
sns.histplot(df['Application_Duration'], bins=30, kde=True, color='purple')
plt.title('Distribution of Application Duration (days)')
plt.xlabel('Duration in Days')
plt.ylabel('Frequency')
plt.show()

print(df['Application_Duration'].describe())

#### *Univariate Analysis - CategoricalVariable*

In [None]:
categorical = [
    'Job_Title', 'Experience_Level', 'Employment_Type', 'Company_Size',
    'Education_Required', 'Industry', 'Company_Name', 'Application_Duration_Category'
]

#### 🔍 *5.What are the most common job titles in AI roles?*

In [None]:
df['Job_Title'].value_counts().head(10).plot(kind='barh', figsize=(10,5), color='skyblue')
plt.title('Top 10 Most Common Job Titles')
plt.xlabel('Number of Job Postings')
plt.gca().invert_yaxis()
plt.show()

#### 🔍 *6. How is experience level distributed?*

In [None]:
df['Experience_Level'].value_counts().plot(kind='bar', color='coral')
plt.title('Distribution of Experience Levels')
plt.ylabel('Count')
plt.xlabel('Experience Level')
plt.show()

#### 🔍 *7. What are the dominant employment types in the AI market?*

In [None]:
df['Employment_Type'].value_counts().plot(kind='pie', autopct='%1.1f%%', startangle=90, colors=['#ff9999','#66b3ff','#99ff99','#ffcc99'])
plt.title('Employment Type Distribution')
plt.ylabel('')
plt.show()

#### 🔍 *8. How is company size represented?*

In [None]:
sns.countplot(data=df, x='Company_Size', order=df['Company_Size'].value_counts().index, palette='Set2')
plt.title('Company Size Distribution')
plt.xlabel('Company Size')
plt.ylabel('Count')
plt.show()

#### 🔍 *9. What education levels are most demanded?*

In [None]:
df['Education_Required'].value_counts().plot(kind='bar', color='mediumpurple')
plt.title('Education Requirement Distribution')
plt.xlabel('Education Level')
plt.ylabel('Number of Jobs')
plt.show()

#### 🔍 *10. Which industries are hiring the most for AI roles?*

In [None]:
df['Industry'].value_counts().head(10).plot(kind='barh', figsize=(8,5), color='orange')
plt.title('Top 10 Hiring Industries')
plt.xlabel('Job Postings')
plt.gca().invert_yaxis()
plt.show()

#### 🔍*11. Which companies are leading in AI job postings?*

In [None]:
df['Company_Name'].value_counts().head(10).plot(kind='barh', figsize=(8,5), color='teal')
plt.title('Top 10 Companies Hiring for AI Roles')
plt.xlabel('Job Postings')
plt.gca().invert_yaxis()
plt.show()

#### 🔍 *12. How long are companies keeping applications open?*

In [None]:
sns.countplot(data=df, x='Application_Duration_Category', palette='coolwarm', order=sorted(df['Application_Duration_Category'].unique()))
plt.title('Application Duration Category Distribution')
plt.xlabel('Application Duration (days)')
plt.ylabel('Count')
plt.show()

#### *Univariate Analysis - CategoricalVariable*

In [None]:
count = ['Company_Location', 'Employee_Residence']

#### 🔍*13. Which countries have the most AI job postings*

In [None]:
df['Company_Location'].value_counts().head(10).plot(kind='barh', figsize=(8,5), color='lightskyblue')
plt.title('Top 10 Company Locations for AI Jobs')
plt.xlabel('Number of Postings')
plt.gca().invert_yaxis()
plt.show()

#### 🔍 *14. From where are most employees working*

In [None]:
df['Employee_Residence'].value_counts().head(10).plot(kind='barh', figsize=(8,5), color='lightgreen')
plt.title('Top 10 Employee Residences')
plt.xlabel('Number of Employees')
plt.gca().invert_yaxis()
plt.show()

#### *Univariate Analysis - Time Columns*

In [None]:
time = ['Posting_Date', 'Application_Deadline', 'Posting_Month', 'Application_Month']

#### 🔍 *15. How has AI job posting activity varied over time?*

In [None]:
df['Posting_Date'] = pd.to_datetime(df['Posting_Date'])
posting_trend = df['Posting_Date'].value_counts().sort_index()
posting_trend.plot(figsize=(10,4), title='Job Postings Over Time', color='teal')
plt.ylabel('Number of Postings')
plt.xlabel('Date')
plt.show()

#### 🔍*16. Do deadlines follow any time patterns?*

In [None]:
df['Application_Deadline'] = pd.to_datetime(df['Application_Deadline'])
deadline_trend = df['Application_Deadline'].value_counts().sort_index()
deadline_trend.plot(figsize=(10,4), title='Application Deadlines Over Time', color='orange')
plt.ylabel('Number of Deadlines')
plt.xlabel('Date')
plt.show()

#### 🔍 *17. Which months saw the most job postings?*

In [None]:
df['Posting_Month'].value_counts().sort_index().plot(kind='bar', color='dodgerblue')
plt.title('Job Postings by Month')
plt.xlabel('Month')
plt.ylabel('Number of Postings')
plt.show()

#### 🔍*18. Which months had the most application deadlines?*

In [None]:
df['Application_Month'].value_counts().sort_index().plot(kind='bar', color='salmon')
plt.title('Application Deadlines by Month')
plt.xlabel('Month')
plt.ylabel('Number of Deadlines')
plt.show()

#### *Univariate Analysis - special case Columns*

In [None]:
# Check distribution of remote_ratio values
df['Remote_Ratio'].value_counts(normalize=True).plot(kind='bar', color='mediumseagreen')
plt.title('Distribution of Remote Work Ratio')
plt.xlabel('Remote Ratio (%)')
plt.ylabel('Proportion of Jobs')
plt.show()

#### *Univariate Analysis - Text Columns*

In [None]:
# Assuming skills are comma-separated strings
skills_series = df['Required_Skills'].dropna().str.split(',')

all_skills = [skill.strip() for sublist in skills_series for skill in sublist]

skill_counts = Counter(all_skills)

# Show top 15 skills
top_skills = skill_counts.most_common(15)

skills_df = pd.DataFrame(top_skills, columns=['Skill', 'Count'])

skills_df.plot.bar(x='Skill', y='Count', legend=False, figsize=(12,5), color='coral')
plt.title('Top 15 Required Skills in AI Job Postings')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

#### *Bivariate Analysis*

#### 🔍 *1. How does salary vary by experience level?*

In [None]:
# Group average salary by experience level
avg_salary_exp = df.groupby('Experience_Level')['Salary_USD'].mean().sort_values()

avg_salary_exp.plot(kind='bar', color='lightgreen', figsize=(8, 5))
plt.title('Average Salary by Experience Level')
plt.xlabel('Experience Level')
plt.ylabel('Average Salary (USD)')
plt.xticks(rotation=0)
plt.show()

#### 🔍 *2. How does company size affect salary?*

In [None]:
# Median salary grouped by company size
median_salary_company = df.groupby('Company_Size')['Salary_USD'].median().sort_values()

median_salary_company.plot(kind='bar', color='steelblue', figsize=(8, 5))
plt.title('Median Salary by Company Size')
plt.xlabel('Company Size')
plt.ylabel('Median Salary (USD)')
plt.show()

#### 🔍 *3. How does remote flexibility (Remote Ratio) impact salary?*

In [None]:
df.groupby('Remote_Ratio')['Salary_USD'].mean().plot(kind='bar', color='coral', figsize=(8, 5))
plt.title('Average Salary by Remote Ratio')
plt.xlabel('Remote Ratio (%)')
plt.ylabel('Average Salary (USD)')
plt.show()

#### 🔍 *4. Which employment types offer the highest salaries?*

In [None]:
df.groupby('Employment_Type')['Salary_USD'].mean().sort_values().plot(kind='bar', color='orchid', figsize=(8, 5))
plt.title('Average Salary by Employment Type')
plt.xlabel('Employment Type')
plt.ylabel('Average Salary (USD)')
plt.show()

#### 🔍 *5. What’s the average salary for different education requirements?*

In [None]:
df.groupby('Education_Required')['Salary_USD'].mean().sort_values().plot(kind='bar', color='goldenrod', figsize=(9, 5))
plt.title('Average Salary by Education Level')
plt.xlabel('Education Level')
plt.ylabel('Average Salary (USD)')
plt.xticks(rotation=45)
plt.show()

#### 🔍 *6. Do different industries pay significantly different salaries?*

In [None]:
top_industries = df['Industry'].value_counts().head(10).index
filtered = df[df['Industry'].isin(top_industries)]

filtered.groupby('Industry')['Salary_USD'].mean().sort_values().plot(kind='bar', color='mediumseagreen', figsize=(10, 5))
plt.title('Average Salary in Top Industries')
plt.xlabel('Industry')
plt.ylabel('Average Salary (USD)')
plt.xticks(rotation=45)
plt.show()

#### 🔍 *7. What’s the relationship between benefits score and salary?*

In [None]:
sns.scatterplot(data=df, x='Benefits_Score', y='Salary_USD', color='teal')
plt.title('Salary vs Benefits Score')
plt.xlabel('Benefits Score')
plt.ylabel('Salary (USD)')
plt.show()

#### 🔍 *8. Is salary influenced by years of experience?*

In [None]:
sns.scatterplot(data=df, x='Years_Experience', y='Salary_USD', color='tomato')
plt.title('Salary vs Years of Experience')
plt.xlabel('Years of Experience')
plt.ylabel('Salary (USD)')
plt.show()

#### 🔍 *9. How does remote ratio relate to employment type?*

In [None]:
ct = pd.crosstab(df['Remote_Ratio'], df['Employment_Type'], normalize='index')
ct.plot(kind='bar', stacked=True, colormap='Set2', figsize=(9, 5))
plt.title('Remote Ratio vs Employment Type')
plt.xlabel('Remote Ratio (%)')
plt.ylabel('Proportion')
plt.legend(title='Employment Type')
plt.show()

#### 🔍 *10. How are experience levels distributed across company sizes?*

In [None]:
ct = pd.crosstab(df['Company_Size'], df['Experience_Level'], normalize='index')
ct.plot(kind='bar', stacked=True, colormap='Pastel1', figsize=(9, 5))
plt.title('Company Size vs Experience Level')
plt.xlabel('Company Size')
plt.ylabel('Proportion')
plt.legend(title='Experience Level')
plt.show()

#### 🔍 *11. What are the most frequently required skills for AI jobs?*

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

skills = df['Required_Skills'].dropna()
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(skills)

skill_counts = pd.Series(X.toarray().sum(axis=0), index=vectorizer.get_feature_names_out()).sort_values(ascending=False)
skill_counts.head(20).plot(kind='bar', color='slateblue', figsize=(10, 5))
plt.title('Top 20 Required Skills')
plt.xlabel('Skill')
plt.ylabel('Frequency')
plt.show()

#### *Multivariate Analysis*

#### 🔍 *1. How do required skills and experience level together influence average salary?*

In [None]:
sns.set_style("whitegrid")
plt.figure(figsize=(14, 8))

# Create pivot table with all three factors
pivot = df.pivot_table(
    values='Salary_USD',
    index=['Experience_Level', 'Company_Size'],
    columns='Remote_Ratio',
    aggfunc='mean'
).stack().reset_index(name='Avg_Salary')

# Create a faceted bar plot
g = sns.FacetGrid(
    pivot, 
    col='Company_Size',
    col_wrap=3,
    height=4,
    aspect=1.2,
    sharey=True
)
g.map_dataframe(
    sns.barplot,
    x='Experience_Level',
    y='Avg_Salary',
    hue='Remote_Ratio',
    palette='viridis',
    errorbar=None
)

# Customize the plot
g.set_titles("Company Size: {col_name}")
g.set_axis_labels("Experience Level", "Average Salary (USD)")
g.fig.suptitle(
    'Salary by Experience Level, Company Size & Remote Ratio',
    y=1.05,
    fontsize=16,
    fontweight='bold'
)

# Adjust legend (without value annotations)
g.add_legend(
    title='Remote Ratio',
    bbox_to_anchor=(1.05, 0.5),
    frameon=True
)

plt.tight_layout()
plt.show()

#### 🔍 *2. Which combinations of job title and education level lead to higher salaries? (Top roles only)*

In [None]:
top_titles = df['Job_Title'].value_counts().head(10).index
filtered = df[df['Job_Title'].isin(top_titles)]

grouped = filtered.groupby(['Job_Title', 'Education_Required'])['Salary_USD'].mean().unstack()
grouped.plot(kind='bar', figsize=(12,6), colormap='Set3')
plt.title('Avg Salary by Job Title and Education Level')
plt.xlabel('Job Title')
plt.ylabel('Average Salary (USD)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

#### 🔍 *3. How does company size, employment type, and remote ratio interact?*

In [None]:
ct = pd.crosstab(
    [df['Company_Size'], df['Employment_Type']],
    df['Remote_Ratio'],
    normalize='index'
)

ct.plot(kind='bar', stacked=True, colormap='coolwarm', figsize=(12,6))
plt.title('Remote Ratio by Company Size and Employment Type')
plt.xlabel('Company Size + Employment Type')
plt.ylabel('Proportion')
plt.tight_layout()
plt.show()

#### 🔍 *4. How do required skills and experience level together influence average salary?*

In [None]:
# Step 1: Extract all skills and count their frequencies (manual alternative to CountVectorizer)
all_skills = df['Required_Skills'].str.lower().str.split(',|;|and|&').explode().str.strip()
skill_counts = all_skills.value_counts()

# Get top 10 skills
top_skills = skill_counts.head(10).index.tolist()

# Step 2: Add binary columns for each top skill
for skill in top_skills:
    df[skill] = df['Required_Skills'].str.lower().str.contains(skill.lower(), na=False).astype(int)

# Step 3: Group by experience level and calculate average salary for each skill
plt.figure(figsize=(14, 6))

for skill in top_skills:
    # Calculate average salary by experience level for jobs requiring this skill
    avg_salary = df[df[skill] == 1].groupby('Experience_Level')['Salary_USD'].mean()
    plt.plot(avg_salary.index, avg_salary.values, label=skill, marker='o')

plt.title('Average Salary by Top Skills and Experience Level')
plt.xlabel('Experience Level')
plt.ylabel('Average Salary (USD)')
plt.legend(title='Skill', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()

#### 🔍 *5. Which industries offer the best salary for each employment type?*

In [None]:
# Pivot table: Avg salary per industry for each employment type
pivot = df.pivot_table(
    values='Salary_USD',
    index='Industry',
    columns='Employment_Type',
    aggfunc='mean'
)

# Find the top-paying industry for each employment type
best_industries = pivot.idxmax().reset_index()
best_industries.columns = ['Employment_Type', 'Best_Paying_Industry']

# Get their corresponding salaries
best_salaries = pivot.max().reset_index()
best_salaries.columns = ['Employment_Type', 'Highest_Salary_USD']

# Combine results
result = pd.merge(best_industries, best_salaries, on='Employment_Type')

print("Top-paying industries for each employment type:")
print(result)

# Visualize
plt.figure(figsize=(12, 6))
sns.barplot(
    data=result,
    x='Employment_Type',
    y='Highest_Salary_USD',
    hue='Best_Paying_Industry',
    palette='viridis'
)
plt.title('Highest Paying Industries by Employment Type')
plt.xlabel('Employment Type')
plt.ylabel('Average Salary (USD)')
plt.xticks(rotation=45)
plt.legend(title='Industry', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

#### 🔍 *6. Do posting/application months affect salary across job roles?*

In [None]:
# Group data by month and job title, then calculate mean salary
monthly_salary = df.groupby(['Posting_Month', 'Job_Title'])['Salary_USD'].mean().unstack().fillna(0)

# Select top 5 highest-paying job titles (or most frequent)
top_jobs = monthly_salary.mean().nlargest(5).index  # Alternative: .sum() for most frequent
monthly_salary = monthly_salary[top_jobs]

# --- Data Visualization ---
plt.figure(figsize=(14, 7))
sns.set_style("whitegrid")  # Clean background with grid
palette = sns.color_palette("husl", n_colors=len(top_jobs))  # Vibrant color palette

# Line Plot (Trend Analysis)
sns.lineplot(
    data=monthly_salary.reset_index().melt(id_vars='Posting_Month'),
    x='Posting_Month',
    y='value',
    hue='Job_Title',
    palette=palette,
    linewidth=2.5,
    marker='o',  # Add markers for each data point
    markersize=8
)

# Enhancements
plt.title('Monthly Trends in Average Salary for Top 5 Job Titles', fontsize=16, pad=20)
plt.xlabel('Posting Month', fontsize=12)
plt.ylabel('Average Salary (USD)', fontsize=12)
plt.xticks(rotation=45)  # Rotate x-labels for readability
plt.legend(title='Job Title', bbox_to_anchor=(1.05, 1), loc='upper left')  # Move legend outside

# Add data labels for the highest and lowest points
for job in top_jobs:
    max_salary = monthly_salary[job].max()
    min_salary = monthly_salary[job].min()
    max_month = monthly_salary[job].idxmax()
    min_month = monthly_salary[job].idxmin()
    
    plt.annotate(f"${max_salary:,.0f}", 
                 (max_month, max_salary),
                 textcoords="offset points",
                 xytext=(0,10),
                 ha='center',
                 fontsize=9)
    
    plt.annotate(f"${min_salary:,.0f}", 
                 (min_month, min_salary),
                 textcoords="offset points",
                 xytext=(0,-15),
                 ha='center',
                 fontsize=9)

# Add a footnote for context
plt.figtext(0.5, -0.1, 
            "Note: Salaries are averaged across job postings per month.", 
            ha="center",
            fontsize=10,
            color='gray')

plt.tight_layout()
plt.show()

#### 🔍 *7. How do application duration categories vary by experience level and education?*

In [None]:
ct = pd.crosstab(
    [df['Experience_Level'], df['Education_Required']],
    df['Application_Duration_Category'],
    normalize='index'
)

ct.plot(kind='bar', stacked=True, colormap='Paired', figsize=(14,6))
plt.title('Application Duration Category by Experience and Education')
plt.xlabel('Experience + Education')
plt.ylabel('Proportion')
plt.tight_layout()
plt.show()