# Data Science Salaries

## Introduction
In recent years, the field of data science has experience exponential growth, with organizations across various industries recognizing the value of data-driven insights. As demand for data scientists continues to rise, understanding the factors that influence job salaries becomes crucial for both employers and job seekers. This project aims to analyze data science job salaries in the United States from 2020 to 2023, exploring trends and patterns in compensation and developing a machine learning model to predict salaries based on multiple factors. 

### Stages
1. Data Preprocessing
2. Exploratory Data Analysis
3. Statistical Analysis
3. Feature Preparation
4. Model Development
5. Model Evaluation
6. Conclusion

## Data Preprocessing

In [1]:
# import libraries
import pandas as pd
import numpy as np

In [2]:
# load data
df = pd.read_csv('v7_Latest_Data_Science_Salaries.csv')

In [3]:
# general info on data
df.info()
df.sample(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5736 entries, 0 to 5735
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Job Title           5736 non-null   object
 1   Employment Type     5736 non-null   object
 2   Experience Level    5736 non-null   object
 3   Expertise Level     5736 non-null   object
 4   Salary              5736 non-null   int64 
 5   Salary Currency     5736 non-null   object
 6   Company Location    5736 non-null   object
 7   Salary in USD       5736 non-null   int64 
 8   Employee Residence  5736 non-null   object
 9   Company Size        5736 non-null   object
 10  Year                5736 non-null   int64 
dtypes: int64(3), object(8)
memory usage: 493.1+ KB


Unnamed: 0,Job Title,Employment Type,Experience Level,Expertise Level,Salary,Salary Currency,Company Location,Salary in USD,Employee Residence,Company Size,Year
257,Research Engineer,Full-Time,Entry,Junior,32982,British Pound Sterling,South Africa,40581,South Africa,Medium,2023
38,Data Scientist,Full-Time,Mid,Intermediate,97000,United States Dollar,United States,97000,United States,Medium,2024
3686,Machine Learning Engineer,Full-Time,Senior,Expert,134500,United States Dollar,United States,134500,United States,Large,2023
3442,Data Architect,Full-Time,Mid,Intermediate,135000,United States Dollar,United States,135000,United States,Medium,2023
2509,Research Engineer,Full-Time,Mid,Intermediate,70000,British Pound Sterling,United Kingdom,86128,United Kingdom,Medium,2023
4274,Data Analyst,Full-Time,Entry,Junior,50000,United States Dollar,United States,50000,Kuwait,Large,2023
4643,AI Scientist,Full-Time,Entry,Junior,200000,United States Dollar,Canada,200000,Canada,Large,2022
5602,Data Architect,Full-Time,Mid,Intermediate,180000,United States Dollar,United States,180000,United States,Large,2021
5077,Machine Learning Developer,Full-Time,Entry,Junior,15000,United States Dollar,Thailand,15000,Thailand,Large,2021
5006,Machine Learning Engineer,Full-Time,Senior,Expert,110000,United States Dollar,Canada,110000,Canada,Medium,2022


In [4]:
# fix columns naming conventions
df.columns = df.columns.str.lower()
df.columns = df.columns.str.replace(' ','_')

In [5]:
# change all string values to lowercase
columns_str = ['job_title','employment_type','experience_level','expertise_level','salary_currency','company_location','employee_residence','company_size']
for element in columns_str:
    df[element] = df[element].apply(lambda x: x.lower())

In [6]:
# check changes
df.info()
df.sample(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5736 entries, 0 to 5735
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   job_title           5736 non-null   object
 1   employment_type     5736 non-null   object
 2   experience_level    5736 non-null   object
 3   expertise_level     5736 non-null   object
 4   salary              5736 non-null   int64 
 5   salary_currency     5736 non-null   object
 6   company_location    5736 non-null   object
 7   salary_in_usd       5736 non-null   int64 
 8   employee_residence  5736 non-null   object
 9   company_size        5736 non-null   object
 10  year                5736 non-null   int64 
dtypes: int64(3), object(8)
memory usage: 493.1+ KB


Unnamed: 0,job_title,employment_type,experience_level,expertise_level,salary,salary_currency,company_location,salary_in_usd,employee_residence,company_size,year
2076,research scientist,full-time,mid,intermediate,270000,united states dollar,united states,270000,united states,large,2023
3499,data analyst,full-time,senior,expert,81800,united states dollar,united kingdom,81800,united kingdom,medium,2023
5262,data analyst,full-time,entry,junior,15000,united states dollar,indonesia,15000,indonesia,large,2022
2479,machine learning engineer,full-time,mid,intermediate,170000,united states dollar,united states,170000,united states,small,2023
5426,director of data science,full-time,executive,director,250000,canadian dollar,canada,192037,canada,large,2022
367,data modeler,full-time,senior,expert,110000,united states dollar,united states,110000,united states,medium,2023
800,business intelligence analyst,full-time,entry,junior,35000,british pound sterling,united kingdom,43064,united kingdom,medium,2023
1055,machine learning engineer,full-time,senior,expert,255000,united states dollar,united states,255000,united states,medium,2023
3049,bi analyst,full-time,senior,expert,87200,united states dollar,united states,87200,united states,medium,2023
124,data science engineer,full-time,mid,intermediate,135000,united states dollar,united states,135000,united states,medium,2024


In [7]:
# check for duplicates in df
df.duplicated().sum()

0

In [8]:
# check for implicit duplicates in job title column
df.job_title.unique()

array(['data engineer', 'data analyst', 'business intelligence developer',
       'bi developer', 'business intelligence analyst', 'data developer',
       'ai architect', 'data architect', 'data scientist',
       'machine learning engineer', 'data science', 'research engineer',
       'data science manager', 'data analytics manager',
       'research analyst', 'ai engineer', 'research scientist',
       'data science engineer', 'data product manager',
       'analytics engineer', 'data specialist', 'data modeler',
       'etl developer', 'data strategist', 'prompt engineer',
       'data science lead', 'ml engineer', 'data quality manager',
       'applied scientist', 'head of data',
       'business intelligence engineer', 'data science consultant',
       'machine learning scientist', 'business intelligence manager',
       'data manager', 'computer vision engineer', 'ai product manager',
       'data analytics lead', 'director of data science',
       'data product owner', 'machin

In [9]:
# fix implicit duplicates
df.job_title = df.job_title.apply(lambda x: x.replace('ml','machine learning'))
df.job_title = df.job_title.apply(lambda x: x.replace('bi','business intelligence'))

In [10]:
# remove salaries for companies not located in US
df = df[df.company_location=='united states']

In [11]:
# double check unique values in columns
print(df.company_location.unique())
print(df.salary_currency.unique())
print(df.employee_residence.unique())

['united states']
['united states dollar' 'british pound sterling' 'euro' 'indian rupee'
 'hungarian forint']
['united states' 'uganda' 'italy' 'thailand' 'canada' 'philippines'
 'germany' 'tunisia' 'belgium' 'turkey' 'nigeria' 'ghana' 'india' 'egypt'
 'uzbekistan' 'argentina' 'france' 'portugal' 'kuwait' 'spain' 'china'
 'costa rica' 'chile' 'bolivia, plurinational state of' 'malaysia'
 'brazil' 'russian federation' 'viet nam' 'greece' 'bulgaria' 'hungary'
 'puerto rico' 'romania']


In [12]:
# remove unnecessary columns
df = df.drop(['company_location','salary','salary_currency','employee_residence'],axis=1)

In [13]:
# check changes
df.info()
df.sample(10)

<class 'pandas.core.frame.DataFrame'>
Index: 4564 entries, 0 to 5734
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   job_title         4564 non-null   object
 1   employment_type   4564 non-null   object
 2   experience_level  4564 non-null   object
 3   expertise_level   4564 non-null   object
 4   salary_in_usd     4564 non-null   int64 
 5   company_size      4564 non-null   object
 6   year              4564 non-null   int64 
dtypes: int64(2), object(5)
memory usage: 285.2+ KB


Unnamed: 0,job_title,employment_type,experience_level,expertise_level,salary_in_usd,company_size,year
206,data science manager,full-time,mid,intermediate,190000,medium,2023
3412,data engineer,full-time,senior,expert,142200,medium,2023
729,research engineer,full-time,senior,expert,177700,medium,2023
4859,data engineer,full-time,senior,expert,247500,medium,2022
3132,data scientist,full-time,senior,expert,184250,medium,2023
3683,data analyst,full-time,senior,expert,147000,medium,2023
1105,research engineer,full-time,senior,expert,115607,medium,2023
1850,data scientist,full-time,senior,expert,155000,medium,2023
449,business intelligence analyst,full-time,senior,expert,176875,medium,2023
1588,machine learning engineer,full-time,senior,expert,140100,medium,2023


## Exploratory Data Exploration

In [14]:
import matplotlib.pyplot as plt
import plotly.express as px

In [15]:
# distribution of salaries in us
hist1 = px.histogram(
    df,x='salary_in_usd',
    title='Distribution of Data Science Jobs Salaries',
    labels={'salary_in_usd':'Salary'}
    )
hist1.update_layout(title_font_size=16,bargap=0.1,height=500)
hist1.show()

The distribution of data science jobs salaries appear to be normally distributed with a mean around $135K. There are some significant outliers that could be skewing the data. These very large salaries in the dataset are likely for a few employees who hold very high positions in a company, or could be freelance or contract workers that are able to charge high rates for their services. I will examine these outliers further and will need to determine whether they should be included in the dataset for the machine learning model training. 

In [16]:
# view the top 20 job titles in the distribution
top_jobs = df.job_title.value_counts()[:20]

In [17]:
# bar graph of top jobs
fig = px.bar(
    top_jobs,
    x=top_jobs,
    y=top_jobs.index,
    color=top_jobs.index,
    labels={'job_title':'Job Title','x':'Number of Jobs'},
    title='Top 20 Data Science Jobs',
    text_auto=True
)
fig.update_layout(title_font_size=16,showlegend=False,height=600)


Data Engineer is the most common job title in the dataset, suggesting that there is a significant demand for these professionals with skills in data engineering and infrastructure. Data Scientist is the 2nd most common job title with Data Analyst being 3rd. This also suggests that there is a demand for higher level data professionals. The next most common job title is Machine Learning Engineer, which suggests that there is also quite a high demand for data professionals with skills in creating ML models. There are also many research and development roles that are in the top 20 jobs, such as research engineer, applied scientist and research analyst. This suggests that there is a significant number of roles in data science that are more research-oriented. The other large group of roles in this top 20 listing is for specialist roles, such as data architect, data specialist and analytics engineer, which reflect that there are quite a few roles that require specialized skills. 

In [18]:
# distribution of employment type
fig = px.histogram(
    df,x='employment_type',
    title='Distribution of Types of Employment',
    labels={'employment_type':'Employment Type'},
    category_orders=dict(employment_type=['freelance','contract','part-time','full-time']),
    color='employment_type',
    text_auto=True
)
fig.update_layout(title_font_size=16,showlegend=False)

Nearly all of the jobs listed in this dataset are for full-time employees. This reflects that most data science jobs are full-time positions with only about 1% of the positions being freelance, contract or part-time positions. 

In [19]:
# distribution of experience level
fig = px.histogram(
    df,x='experience_level',
    color='experience_level',
    title='Distribution of Experience Level',
    labels={'experience_level':'Experience Level'},
    category_orders=dict(experience_level=['entry','mid','senior','executive']),
    text_auto=True
    )
fig.update_layout(title_font_size=16,showlegend=False)

More than 25% of the dataset consists of salaries of senior level employees. This demonstrates most of the data science positions are higher level positions with a much more limited availability of entry level positions.

In [20]:
# distribution of level of expertise
fig = px.histogram(
    df,x='expertise_level',
    color='expertise_level',
    title='Distribution of Expertise',
    labels={'expertise_level':'Expertise'},
    category_orders=dict(expertise_level=['junior','intermediate','expert','director']),
    text_auto=True
    )
fig.update_layout(title_font_size=16,showlegend=False)

The distribution of expertise is exactly the same as the experience level. Since the data is exactly the same for these two columns, I will only need to include one of them as a feature for the machine learning model. 

In [21]:
# distribution of company size
fig = px.histogram(
    df,x='company_size',
    color='company_size',
    title='Distribution of Company Size',
    labels={'company_size':'Company Size'},
    category_orders=dict(company_size=['small','medium','large']),
    text_auto=True
    )
fig.update_layout(title_font_size=16,showlegend=False)

Nearly 90% of the data science jobs in the dataset are for medium sized companies. 

In [22]:
# distribution of years
fig = px.histogram(
    df,x='year',
    color='year',
    title='Distribution of Jobs Across the Years',
    labels={'year':'Year'},
    text_auto=True
    )
fig.update_layout(title_font_size=16,showlegend=False)

The vast majority of jobs from the dataset are from 2023. Since we are still just at the beginning of 2024, we are likely only seeing a small portion of data science jobs from the current year. This data reflects that there the number of these data science positions is growing at a significant rate, but we would need further information from previous years to accurately draw those conclusions. 

In [23]:
# distribution of salaries based on employment type
fig = px.histogram(
    df,
    x='salary_in_usd',
    color='experience_level',
    barmode='overlay',
    labels={'experience_level':'Experience Level','salary_in_usd':'Salary (USD)'},
    title='Distribution of Salaries Based on Experience Level')
fig.update_layout(title_font_size=16)

In [24]:
# box plot of salaires based on experience level
fig = px.box(
    df,
    x='salary_in_usd',
    y='experience_level',
    color='experience_level',
    title='Distribution of Salaries Based on Experience Level',
    labels={'experience_level':'Experience Level','salary_in_usd':'Salary (USD)'}
)
fig.update_layout(title_font_size=16,boxgap=0.1,showlegend=False)
fig.update_traces(boxmean=True)

We can see that the distribution of salaries is impacted by the employee's experience level, with the mean and median being smaller values for lower experience levels. We can also see that all of the distributions do have some outliers, but the senior and mid experience levels have some very high value outliers that will need to be addressed prior to training the predictive model. 

In [25]:
# distribution of salaries based on company size
fig = px.histogram(
    df,
    x='salary_in_usd',
    color='company_size',
    barmode='overlay',
    labels={'company_size':'Company Size','salary_in_usd':'Salary (USD)'},
    title='Distribution of Salaries Based on Company Size')
fig.update_layout(title_font_size=16)

In [26]:
# box plot of salaires based on company size
fig = px.box(
    df,
    x='salary_in_usd',
    y='company_size',
    color='company_size',
    title='Distribution of Salaries Based on Company Size',
    labels={'company_size':'Company Size','salary_in_usd':'Salary (USD)'}
)
fig.update_layout(title_font_size=16,boxgap=0.1,showlegend=False)
fig.update_traces(boxmean=True)

Here we can see that the distribution of salaries is also impacted by the company size, but only if you work in a small company, with the median and mean of that distribution being a smaller value than that of the medium and large size companies. The distributions of the medium and large size companies look pretty similar. I will perform statistical testing to see if the means are significantly different. 

In [27]:
# distribution of salaries based on year
fig = px.histogram(
    df,
    x='salary_in_usd',
    color='year',
    barmode='overlay',
    labels={'year':'Year','salary_in_usd':'Salary (USD)'},
    title='Distribution of Salaries Based on Year')
fig.update_layout(title_font_size=16)

In [28]:
# box plot of salaires based on year
fig = px.box(
    df,
    x='salary_in_usd',
    color='year',
    title='Distribution of Salaries Based on Year',
    labels={'year':'Year','salary_in_usd':'Salary (USD)'}
)
fig.update_layout(title_font_size=16,boxgap=0.1)
fig.update_traces(boxmean=True)

The distribution of salaries based on year are all fairly similar, but there does seem to be a trend of the medium salaries increasing each year, except for 2024. As we do not have a complete picture of the data for 2024, we cannot confirm these assumptions. 

## Statistical Analysis

In [29]:
import scipy.stats as st

I will test the following hypotheses using a critical statistical signicance of an alpha value of 5%
- Null Hypothesis: The average salary for mid and senior experience employees are equal.
- Alternative Hypothesis: The average salary for mid experience employees is less than senior experience.

In [30]:
# test the hypothesis
alpha = 0.05
results = st.ttest_ind(df[df.experience_level=='mid']['salary_in_usd'],df[df.experience_level=='senior']['salary_in_usd'],alternative='less')
print(f'p-value: {results.pvalue}')

p-value: 3.730777167778775e-48


With a pvalue less than our alpha value we can reject the hypothesis and can conclude that the average salary for senior level employees is indeed higher than that of mid level employees. 

I will test the following hypotheses using a critical statistical signicance of an alpha value of 5%
- Null Hypothesis: The average salary for employees in medium and large size companies are the same.
- Alternative Hypothesis: The average salary for employees from a medium size company is less than those from a large size company. 

In [31]:
# test the hypothesis
alpha = 0.05
results = st.ttest_ind(df[df.company_size=='medium']['salary_in_usd'],df[df.company_size=='large']['salary_in_usd'],alternative='less')
print(f'p-value: {results.pvalue}')

p-value: 0.8392032967518281


With a pvalue significantly larger than our alpha value, we cannot reject the null hypothesis and can conclude that the average salary for employees from a medium size company is not less than those from a large size company.