---
---
# **From Classroom to Career: Navigating the Data & AI Job Market**

**Capstone Project - Data Visualization and Visual Analytics**
---
---
---



 - This notebook analyzes about various AI job salaries across time, geography, experience, and company size. It aims to empower fresh graduates, professionals, and recruiters with transparent insights from the data.

 - Source: [Dataset](https://aijobs.net/salaries/download/) &
  [Kaggle Reference](https://www.kaggle.com/datasets/adilshamim8/salaries-for-data-science-jobs/data)



# **Dataset Explaination**

| **Feature**              | **Range**                                                                                              | **Explanation**                                                                 |
|--------------------------|--------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------|
| `work_year`              | {2020, 2021, 2022, 2023, 2024, 2025}                                                                          | The year when the salary was paid.                                       |
| `experience_level`       | {EN, MI, SE, EX}                                                                                       | Level of professional experience: EN = Entry, MI = Mid-level, SE = Senior, EX = Executive. |
| `employment_type`        | {PT, FT, CT, FL}                                                                                       | Type of employment: PT = Part-time, FT = Full-time, CT = Contract, FL = Freelance. |
| `job_title`              | ℂ (e.g., Data Scientist, ML Engineer, AI Researcher)                                                   | Job title of the employee.                                                     |
| `salary`                 | ℝ (e.g., 10,000–1,000,000)                                                                             | Annual gross salary in original currency.                                      |
| `salary_currency`        | ℂ (e.g., USD, EUR, INR)                                                                               | Currency in which the salary was paid.                       |
| `salary_in_usd`          | ℝ (e.g., 10,000–1,000,000)                                                                             | Annual gross salary converted to USD.         |
| `employee_residence`     | ℂ (e.g., US, DE, IN)                                                                                   | Country of residence of the employee.                  |
| `remote_ratio`           | {0, 50, 100}                                                                                           | Percentage of remote work: 0 = on-site, 50 = hybrid, 100 = fully remote.       |
| `company_location`       | ℂ (e.g., US, DE, IN)                                                                                   | Country where the employing company is based.          |
| `company_size`           | {S, M, L}                                                                                              | Company size: S = Small (<50), M = Medium (50–250), L = Large (>250).          |


# Imports and Settings

In [1]:
#Importing all the required Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

import warnings
warnings.filterwarnings("ignore")

## The Raw Data

In [2]:
# Loading the dataset.
df = pd.read_csv('salaries (1).csv')
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2025,SE,FT,Data Specialist,176000,USD,176000,US,0,US,M
1,2025,SE,FT,Data Specialist,77500,USD,77500,US,0,US,M
2,2025,MI,FT,Applied Scientist,223400,USD,223400,US,0,US,L
3,2025,MI,FT,Applied Scientist,136000,USD,136000,US,0,US,L
4,2025,EN,FT,Data Analyst,96000,USD,96000,US,0,US,M


## Initial Data Analysis

In [3]:
# Displaying the number of rows and columns in the dataset.

print(f"Dataset contains {df.shape[0]} rows and {df.shape[1]} columns.")

Dataset contains 139241 rows and 11 columns.


In [4]:
# Basic information about the dataset.

print("\nDataset Info:\n")
df.info()


Dataset Info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 139241 entries, 0 to 139240
Data columns (total 11 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   work_year           139241 non-null  int64 
 1   experience_level    139241 non-null  object
 2   employment_type     139241 non-null  object
 3   job_title           139241 non-null  object
 4   salary              139241 non-null  int64 
 5   salary_currency     139241 non-null  object
 6   salary_in_usd       139241 non-null  int64 
 7   employee_residence  139241 non-null  object
 8   remote_ratio        139241 non-null  int64 
 9   company_location    139241 non-null  object
 10  company_size        139241 non-null  object
dtypes: int64(4), object(7)
memory usage: 11.7+ MB


In [5]:
# Displaying the descriptive statistics for all numeric columns

df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
work_year,139241.0,2024.38582,0.678628,2020.0,2024.0,2024.0,2025.0,2025.0
salary,139241.0,162931.7035,213375.175187,14000.0,106000.0,147000.0,198800.0,30400000.0
salary_in_usd,139241.0,157458.393555,74158.66803,15000.0,105800.0,146000.0,197800.0,800000.0
remote_ratio,139241.0,20.895785,40.585185,0.0,0.0,0.0,0.0,100.0


In [6]:
# Displaying the total number of null values in the dataset

print("\nList of Null Values:")
df.isnull().sum()


List of Null Values:


Unnamed: 0,0
work_year,0
experience_level,0
employment_type,0
job_title,0
salary,0
salary_currency,0
salary_in_usd,0
employee_residence,0
remote_ratio,0
company_location,0


In [7]:
# Displaying sum of unique values in the dataset

df.nunique()

Unnamed: 0,0
work_year,6
experience_level,4
employment_type,4
job_title,405
salary,11579
salary_currency,26
salary_in_usd,12794
employee_residence,102
remote_ratio,3
company_location,95


## Preprocessing

In [8]:
# Mapping all the required categorical variables

df['experience_level'] = df['experience_level'].map({'EN': 'Entry', 'MI': 'Mid', 'SE': 'Senior', 'EX': 'Executive'})
df['employment_type'] = df['employment_type'].map({'FT': 'Full-time', 'PT': 'Part-time', 'CT': 'Contract', 'FL': 'Freelance'})
df['company_size'] = df['company_size'].map({'S': 'Small', 'M': 'Medium', 'L': 'Large'})
df['remote_category'] = df['remote_ratio'].map({0: 'On-site', 50: 'Hybrid', 100: 'Remote'})

In [9]:
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size,remote_category
0,2025,Senior,Full-time,Data Specialist,176000,USD,176000,US,0,US,Medium,On-site
1,2025,Senior,Full-time,Data Specialist,77500,USD,77500,US,0,US,Medium,On-site
2,2025,Mid,Full-time,Applied Scientist,223400,USD,223400,US,0,US,Large,On-site
3,2025,Mid,Full-time,Applied Scientist,136000,USD,136000,US,0,US,Large,On-site
4,2025,Entry,Full-time,Data Analyst,96000,USD,96000,US,0,US,Medium,On-site


## Visualisation

In [10]:
# Top 10 jobs with their average salaries

top_10_jobs = df['job_title'].value_counts().head(10).index
job_count = df['job_title'].value_counts()[top_10_jobs]
average_salary = df.groupby('job_title')['salary_in_usd'].mean()[top_10_jobs]

In [11]:
gradient_colors = ["#0f4c75", "#226093", "#3282b8", "#3b84bd", "#4790c6", "#5fa2d2", "#6daeda", "#80bee6", "#9bcff0", "#bbe1fa"]
bar_colors = gradient_colors[:len(job_count)]

fig_1 = go.Figure(go.Bar(x=job_count.values, y=job_count.index,
    orientation='h',marker_color=bar_colors,
    text=[f"${s:,.0f}" for s in average_salary.loc[job_count.index]],
    hovertemplate='<b>%{y}</b><br>Count: %{x}<br>Avg Salary: %{text}<extra></extra>'))

fig_1.update_layout( title="Top 10 Job Titles with Average Salary",
    xaxis_title="Number of Jobs", yaxis_title="Job Title",
    yaxis=dict(autorange="reversed", tickfont=dict(color='black')),
    xaxis=dict( titlefont=dict(color='black'), tickfont=dict(color='black')),
    title_font=dict(color='black'), font=dict(color='black'),height=600,template='plotly_white')

fig_1.show()

The bar chart entitled "Top 10 Job Titles with Average Salary" outlines average salaries and numbers of job postings for different roles in the tech/data sector. It shows that Machine Learning Engineers earn the most with an average salary of 198,712 USD  followed by Research Scientists and Product Managers with average salaries of roughly 197,037 USD and 190,306 USD respectively. What is interesting is that Data Scientists have even lower average salaries (156,673 USD) than their consequence but appear to have the most job opportunities. This visualization illustrates how compensation and demand appear to interact across roles in tech/data and are helpful in planning a career, since it gives the users some insights into the roles they may want to explore.


In [14]:
# Remote Work Ratio Distribution accross various Experience Levels

experience_levels = df['experience_level'].unique()
total_levels = len(experience_levels)
gradient_colors = ["#226093", "#bbe1fa", "#3282b8"]

# Initialising the subplots for Pie Charts
fig_2 = make_subplots(rows=1, cols=total_levels,
    specs=[[{'type':'domain'}]*total_levels],
    subplot_titles=[f"Experience: {level}" for level in experience_levels] )

# Creaing a loop to each Pie Chart
for i, level in enumerate(experience_levels):
    data = df[df['experience_level'] == level]['remote_category'].value_counts()
    colors_for_pie = gradient_colors[:len(data)]
    fig_2.add_trace( go.Pie(labels=data.index, values=data.values, pull=[0.05]*len(data), name=level,marker=dict(colors=colors_for_pie)),row=1, col=i+1)

fig_2.update_layout(title='Remote Work Ratio Distribution by Experience Level',yaxis=dict(autorange="reversed", tickfont=dict(color='black')),
    xaxis=dict( titlefont=dict(color='black'), tickfont=dict(color='black')),
    title_font=dict(color='black'), font=dict(color='black'),width=300 * total_levels,height=500)

fig_2.show()

The pie charts "Remote Work Ratio Distribution by Experience Level" indicate distribution of work arrangements (On-site, Remote, and Hybrid) over four experience levels: Entry, Mid, Senior, and Executive. On-site work is dominant in all levels by more than 75% of distribution. On-site proportion is highest in entry with 84%, and senior is relatively flexible with almost 23.5% remote workers. Hybrid arrangements are limited across all levels. That is, while remote work exists, especially at higher levels of experience, on-site jobs are the prevailing norm in today's job market.

In [15]:
df = df[df['remote_category'].notna()]

# Sorting the Unique Experience Levels
Unique_exp_levels = sorted(df['experience_level'].dropna().unique())

# Creating Subplots for Each Experience Level
fig_3 = make_subplots(rows=1,cols=len(Unique_exp_levels),subplot_titles=[f'Experience: {level}' for level in Unique_exp_levels] )
col_index = 1

for level in Unique_exp_levels:
    # Filtering the Data for each Level
    Each_level = df[df['experience_level'] == level]

    # Calculating the total jobs per year
    total = Each_level.groupby('work_year').size()

    # Counting the remote jobs per year
    remote = Each_level[Each_level['remote_category'] == 'Remote'].groupby('work_year').size()

    # Counting the onsite jobs per year
    onsite = Each_level[Each_level['remote_category'] == 'On-site'].groupby('work_year').size()

    data = pd.DataFrame({'total': total, 'remote': remote, 'onsite': onsite}).fillna(0)
    data['remote_pct'] = (data['remote'] / data['total']) * 100
    data['onsite_pct'] = (data['onsite'] / data['total']) * 100

    # Ploting the Remote Job Line
    fig_3.add_trace(
        go.Scatter(x=data.index, y=data['remote_pct'],mode='lines+markers', name='Remote',
            line=dict(color='#80bee6'), showlegend=(col_index == 1)),row=1, col=col_index)

    # Ploting the Onsite Job Line
    fig_3.add_trace(
        go.Scatter(x=data.index, y=data['onsite_pct'],mode='lines+markers', name='On-site',
            line=dict(color='#0f4c75 '), showlegend=(col_index == 1)), row=1, col=col_index)

    col_index += 1

fig_3.update_layout(title='Remote vs On-site Job Trends by Experience Level',yaxis=dict(autorange="reversed", tickfont=dict(color='black')),
    xaxis=dict( titlefont=dict(color='black'), tickfont=dict(color='black')),
    title_font=dict(color='black'), font=dict(color='black'),  width=320 * len(Unique_exp_levels),height=500,
    template='plotly_white',hovermode='x unified')

fig_3.show()

The line graph entitled "Remote vs On-site Job Trends by Experience Level" illustrates the changes in remote versus on-site job interest from 2020 to 2025 for different experience levels. On-site job roles have increased considerably across all experience levels, Entry, Mid, Senior, and Executive, especially since 2022. Remote job opportunities generally peaked around 2021-2022 and have been gradually declining since then as remote interest decreases. This is especially true for Executive and Mid-level positions. Given the nature of opportunities, it appears there is a solid post-pandemic shift to on-site work, with employees shifted due to organisational changes or policies, and employers prioritising collaboration or due in part to management considering effective management and supervision approaches.

In [16]:
# Total count of Entry levels hired by each company size

freshers_data = df[df['experience_level'] == 'Entry']

freshers_data_company_size = freshers_data['company_size'].value_counts().sort_index().reset_index()
freshers_data_company_size.columns = ['company_size', 'count']

fig_4 = px.bar(freshers_data_company_size,
             x='company_size',y='count',
             title='Number of Freshers Hired by Company Size',
             labels={'company_size': 'Company Size', 'count': 'Number of Freshers Hired'},text='count' ,
             color='company_size',color_discrete_sequence=['#80bee6', '#3282b8', '#0f4c75'])

fig_4.update_traces(textposition='outside')
fig_4.update_layout(yaxis=dict(showticklabels=False, tickfont=dict(color='black')),template='plotly_white',
    xaxis=dict( titlefont=dict(color='black'), tickfont=dict(color='black')),title_font=dict(color='black'), font=dict(color='black') ,height=600)

fig_4.show()

In the bar chart titled "Number of Freshers Hired by Company Size," medium-sized companies had clearly the most hires, with 12,596 hires, which is greater than the total of large (348) and smaller (61) company hires. The results tell me that medium-sized companies are the most aggressive recruiters of fresh graduates, likely due to the fact that they are in a growth period and are looking for scalable and inexpensive talent.

In [17]:
# The Average Salary of each country

country_names = {
    'US': 'United States', 'GB': 'United Kingdom', 'DE': 'Germany', 'FR': 'France',
    'CA': 'Canada', 'IN': 'India', 'AU': 'Australia', 'ES': 'Spain', 'BR': 'Brazil',
    'NL': 'Netherlands', 'JP': 'Japan', 'CH': 'Switzerland', 'IT': 'Italy',
    'SG': 'Singapore', 'SE': 'Sweden', 'MX': 'Mexico', 'FI': 'Finland', 'DK': 'Denmark',
    'PL': 'Poland', 'PT': 'Portugal', 'NZ': 'New Zealand', 'IE': 'Ireland',
    'HK': 'Hong Kong', 'RU': 'Russia', 'BE': 'Belgium', 'IL': 'Israel',
    'UA': 'Ukraine', 'TR': 'Turkey', 'AE': 'United Arab Emirates', 'ZA': 'South Africa',
    'CO': 'Colombia', 'AR': 'Argentina', 'CL': 'Chile', 'AT': 'Austria', 'MY': 'Malaysia',
    'NG': 'Nigeria', 'VN': 'Vietnam', 'KR': 'South Korea', 'TH': 'Thailand'
}

avg_salary_by_residence = df.groupby('employee_residence')['salary_in_usd'].mean().reset_index()

avg_salary_by_residence['country_name'] = avg_salary_by_residence['employee_residence'].map(country_names)
avg_salary_by_residence = avg_salary_by_residence.dropna(subset=['country_name'])

fig_5 = px.choropleth(avg_salary_by_residence,locations='country_name',locationmode='country names',color='salary_in_usd',
                     hover_name='country_name', hover_data={'employee_residence': True, 'salary_in_usd': ':,.0f'},
                     color_continuous_scale=px.colors.sequential.Plasma,title='Average Salary by Employee Residence',
                     labels={'salary_in_usd': 'Average Salary(USD)'},projection='natural earth')

fig_5.update_layout(template='plotly_white',  width=1000,  height=600,
    title_font=dict(color='black'), xaxis=dict(titlefont=dict(color='black'), tickfont=dict(color='black')),
    yaxis=dict(tickfont=dict(color='black')),coloraxis_colorbar=dict(titlefont=dict(color='black'), tickfont=dict(color='black')))

fig_5.show()

The choropleth map "Average Salary by Employee Residence" visually shows the distribution of average salaries for AI-related positions by employee residence along geographic lines (USD). The highest average salaries were found in North America (especially the USA), where average salaries exceed 160,000 USD, followed by Australia and parts of Western Europe (UK, Germany, Switzerland). Countries in Africa, South Asia, and parts of South America had the lowest average salaries, often below 60,000 USD. Like the other maps, this global data reflects not only differences in economic position and cost of living, but also the differences in market development concerning the AI function per region.

In [18]:
# Top Job title for each year

job_title_count_year= df.groupby(['work_year', 'job_title']).size().reset_index(name='count')

top_job_title_count_year = job_title_count_year.loc[job_title_count_year.groupby('work_year')['count'].idxmax()]

fig_6 = px.bar(top_job_title_count_year, x='work_year',y='count',
    text='job_title',title='Job Title with Highest Count per Year',
    labels={'work_year': 'Year', 'count': 'Number of Hires'}, color='job_title',
    color_discrete_sequence=['#0f4c75', '#80bee6', '#3282b8'])

fig_6.update_layout(yaxis=dict(tickfont=dict(color='black')),  xaxis=dict( titlefont=dict(color='black'), tickfont=dict(color='black')),
                    title_font=dict(color='black'),template='plotly_white',showlegend=False)

fig_6.update_traces(textposition='outside')
fig_6.show()

The bar chart "Job Title with the Highest Count per Year," displays the job titles that have the most hiring from 2020 up through 2025. The hiring trend of Data Scientist has consistently led in most years, with a spike in 2023 and 2024 Data Scientist, which were near 9,000 hires. It is worth attempting that Software Engineer has been shown to take first in hirings for 2025 and Data Engineer has shown to take first in hires for 2022. Overall, the trend of jobs implies that while data roles are increasing (and generally in demand), there are minimal ups and downs for roles that have good levels of hires.

In [19]:
# Top Job Title according to each experience level

job_title_count_experience = df.groupby(['experience_level', 'job_title']).size().reset_index(name='count')

top_job_title_count_experience = job_title_count_experience.loc[job_title_count_experience.groupby('experience_level')['count'].idxmax()]

fig_7 = px.bar( top_job_title_count_experience, x='experience_level',y='count',
    text='job_title',title='Job Title with Highest Count per Experience Level',
    labels={'experience_level': 'Experience Level', 'count': 'Number of Hires'},color='job_title',
    color_discrete_sequence=['#0f4c75', '#80bee6', '#3282b8'])

fig_7.update_layout(yaxis=dict(tickfont=dict(color='black')),  xaxis=dict( titlefont=dict(color='black'), tickfont=dict(color='black')),
                    title_font=dict(color='black'),template='plotly_white',showlegend=False)

fig_7.update_traces(textposition='outside')
fig_7.show()

This bar chart emphasises the most commonly hired job title across all experience levels according to the data. The chart tells us that the most often held job title at entry level is Data Analyst. The chart's most common job title at the mid-level and senior level is Data Scientist. The executive title that looks most often hired is Data Engineer, but that doesn't mean that's the executive title in most demand. Data Engineer is a specialised niche for executive leadership. It's interesting to see how the job title preference shifts with experience, offering a glimpse into career paths in the AI discipline.

In [20]:
# Top 10 Job title for Entry level

entry_data = df[df['experience_level'] == 'Entry']

top_job_title_count = entry_data['job_title'].value_counts().nlargest(10).reset_index()
top_job_title_count.columns = ['job_title', 'count']

fig_8 = px.bar(top_job_title_count,x='count',y='job_title',orientation='h',
    color='count',title='Top 10 Entry-Level Job Titles', text='count',
    labels={'count': 'Number of Jobs', 'job_title': 'Job Title'},
    color_continuous_scale=["#bbe1fa", "#3282b8 ", "#226093", "#3b84bd", "#4790c6", "#5fa2d2", "#6daeda", "#80bee6", "#9bcff0", "#0f4c75"])

fig_8.update_traces(textposition='outside')
fig_8.update_coloraxes(showscale=False)
fig_8.update_layout(yaxis=dict(autorange="reversed", tickfont=dict(color='black')),xaxis=dict(showticklabels=False,titlefont=dict(color='black'),
                    tickfont=dict(color='black')),title_font=dict(color='black'), font=dict(color='black'),template='plotly_white',height=600,showlegend=False)

fig_8.show()

This horizontal bar chart summarizes the top 10 most common titles for entry-level jobs in the AI and data industry. The most common entry point is Data Analyst, which accounts for over 5,000 hires, followed by general Analyst positions and Data Scientist roles. In addition, there are multiple technical positions like Data Engineer, Software Engineer, and Machine Learning Engineer as well, which shows they are hiring people with coding and modeling skills. Entry-level jobs like a Research Analyst or Business Intelligence Analyst also point to opportunities for people with various analytical and specific domain experience. Ultimately, this visualization helps show an overview of where entry-level candidates can expect to most likely enter the AI job market.



---
# **Conclusion**

To summarize, the AI job market between 2020 to 2025 demonstrate a substantial growth across positions such as Data Analyst, Data Scientist, and Software Engineer particularly for entry-level candidates. Medium sized companies received more applications from freshers than any other firm category, while the best salaries were all found in the US, UK, Germany, and some other country clusters. As AI continues to evolve, this analysis indicates the dynamic career paths as a function of experience and the nature of the skills in demand will change hiring patterns around the world.

---