# Data Staff Salary Analysis


**Introduction**

In today's data-driven landscape, understanding compensation patterns and trends is paramount to attracting and retaining top talent. The "Data Staff Salary Analysis" project delves into the intricate world of salaries within the data space. This comprehensive analysis aims to shed light on key aspects of compensation, from descriptive statistics and job roles to experience-based comparisons and regional influences.

To embark on this data-driven journey, we will begin by importing essential Python libraries that will serve as the foundation for our analysis. We will then dive into the world of descriptive statistics, examining the salary parameter to gain valuable insights.

Our exploration continues with the creation of a frequency table to visualize the distribution of salaries, followed by the construction of histograms for a clearer understanding of this distribution.

The heart of our analysis lies in deciphering the variety of jobs within the data space and identifying predominant job roles. We will provide both detailed insights and concise summaries to facilitate a comprehensive understanding of the data job landscape.

Furthermore, we will conduct in-depth comparisons, juxtaposing roles with experience levels and extracting meaningful insights. Salary comparisons based on job titles will offer a unique perspective on compensation within the data industry.

This is my first real Data Analysis project. it's a project I started out of curiousity, being an aspiring data personnel with decent enough skills at Python, pandas, matplotlib, seaborn and SQL(at the time, I wasn't sure if I wanted to be an ML engineer, data scientist or data analyst), I decided to carry out this analysis to help me reach some form of understanding of certain things which you would find out as you read through the analysis.

# **Importing all python libraries that would be needed all through the analysis**

In [None]:
#Importing all libraries that would be used in the analaysis
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.interpolate import make_interp_spline

In [None]:
df = pd.read_csv('/kaggle/input/employee-salaries-for-different-job-roles/ds_salaries.csv')
df.head()

# **Descriptive Statistics for the Salary Parameter**




##Creating a frequency table for the distribution
For the frequency distribution table, I chose to use 25k intervals for the salary range.


In [None]:
#Creating a frequency table for the distribution
salary_intervals = range(1, 600002, 25000) #Creates 25k intervals for salaries between 0 and 625k(all values)
labels = [f"${i}-{i+25000-1}" for i in salary_intervals[:-1]]

# Categorize salaries into intervals and calculate frequencies
df['salary_intervals'] = pd.cut(df['salary_in_usd'], bins=salary_intervals, labels=labels, right=False)
frequency_df = df['salary_intervals'].value_counts().reset_index()
frequency_df.columns = ['Salary Range', 'Frequency']
frequency_df = frequency_df.sort_values(by='Salary Range')
frequency_df = frequency_df.reset_index()
print(frequency_df)


## Plotting a histogram to visualize the frequency distribution


In [None]:
# Extract lower and upper salary limits for plotting
frequency_df['Lower Salary'] = frequency_df['Salary Range'].apply(lambda x: int(x.split('-')[0][1:]))
frequency_df['Upper Salary'] = frequency_df['Salary Range'].apply(lambda x: int(x.split('-')[1][1:]))

# Create a histogram using Matplotlib
plt.figure(figsize=(10, 6))
plt.bar(frequency_df['Salary Range'], frequency_df['Frequency'], color='blue')
plt.title('Salary Distribution')
plt.xlabel('Salary Range')
plt.ylabel('Frequency')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

##**Descriptive Statistics from frequency distribution**

From the histogram plot, we can analyze the following attributes:

1. Central Tendency: The modal salary range is 75001-100000, and the mean and median is somewhere between 75000-125000. We would examine the exact figures below.

2. Skewness: The data exhibits a positive/right skew. This means the outliers are on the right side(majority of outliers are those earning way more than the common range of salaries)

3. Spread, Variability, and Data Density: Though the data spread is very high(the histogram ranges from 1-600,000), there is high variability in the data. However, from the histogram, we can infer that the highest data density is in the 50000-150000 range.

4. Outliers: Having already stated that the distribution has a right skew, majority of the outliers are on the right side of the distribution(between 400000-600000).

5. Unimodal Distribution: The histogram has only one peak, the data exhibits one mode, suggesting that the data is unimodal (having a single mode).

6. Symmetry: A symmetric histogram indicates that the data is evenly distributed around the mean, while an asymmetric histogram suggests an uneven distribution.

7. Data Clusters and Patterns: By examining the shape of the histogram, you can identify data clusters or patterns that might provide insights into the underlying characteristics of the data.

8. Range of Values: From the dataset, we see that the minimum salary value is 2809 and the maximum is 600000.

It is important to note that while histograms provide useful visual insights, they may not provide all the statistical details we might need. Further statistical analysis, such as measures of central tendency(mean, median and mode), dispersion(Standard deviation and variance), and other tests, are required to draw more concrete conclusions.




In [None]:
salary_stats = df['salary_in_usd'].describe()
salary_stats

From a quick look at the basic descriptive statistics drawn from the 'salary_in_usd' parameter, it is very easy to see that there is a very massive range between data scientist salaries across the globe. Though it is noteworthy that both the median and mean are well over a 100,000(and more than 200% of average salaries in the United states), it's also worth noting that the standard deviation is a whopping $70,957 - which is 63% of the mean, meaning there are very high levels of variabilty.

It's also worth noting that the mean(112297) is greater than the median(101570), which gives a positive/right skew. This means the outliers are on the right side(majority of outliers are those earning way more than the common range of salaries)

The highest levels of variation however, are in the bottom-earning quarter(Q1 or 25th percentile) and top-earning quarter(Q4 ir 76th-99th percentile), with the 25th percentile having salaries ranging from 2,859 to 62,726 USD per annum (the top value of the percentile being nearly 22x the minimum), and the top 25% having salaries ranging from 150,000 to 600,000 USD annually (the top value of the percentile being 4x the minimum). The middle of the pack (25th-75th percentile) sits pretty ranging from ~63000 to 150,000 USD.

Key takeaways from this include:
- Having the mean and median salary values at over 2x the average personal income in the United States and 10x the global personal income, it is safe to say that data-related professions are very high-paying globally.

- There is a very high salary variability between data professions earnings globally. This could be due to a number of factors like work experience level of the staff, employee residence, company location, company size, job title amongst others, which we would look into in more depth along the line.

# **Variety of Jobs in the Data Space & Predominant Job Roles Analysis**

In [None]:
#Identifying number of unique job titles in the dataset
job_titles = df['job_title'].unique()
len(job_titles)

In [None]:
 #Creating a new dataframe containing each job role and number of times they appeared in the dataset

job_title_counts = df['job_title'].value_counts()
job_title_df = pd.DataFrame({'job_title': job_title_counts.index, 'count': job_title_counts.values})
job_title_df.head()

In [None]:
#Create a bar plot to visualize the data that had been sorted above
plt.figure(figsize=(7,10))
sns.barplot(x='count', y='job_title', data = job_title_df)
plt.title('Job roles of employees')
plt.ylabel('Job Title')
plt.xlabel('Number of employees')
plt.show()


##**Variety of Jobs in the Data Space & Predominant Job Roles Summary**

It's worth noting that there are 50 different job titles recorded in a dataset with just 607 employees! This goes to show the massive range of jobs that are available in the data space.

Even with the huge diversity of roles,over 68% of the employees in our dataset works as either a Data Scientist, Data Engineer, Data Analyst or Machine Learning Engineer. These 4 jobe roles are significantly the most common roles in the data space.

# **Role and Experience Salary Comparison**


##**Basic Salary Comparison Based on Experience Levels**

Here, we would be comparing the salaries of employees based on their experience levels: EN(entry-level), MI(mid-level), SE(senior_level) and EX(expert level).

In [None]:
#Creating dataframe of salary means and experience level
experience_salary_means = df.groupby('experience_level')['salary_in_usd'].mean()
experience_salary_means

In [None]:
#Creating fresh dataframe with same info and sorting by salary
data = {
    'experience_level': ['EN', 'EX', 'MI', 'SE'],
    'salary_in_usd': [61643.318182, 199392.038462, 87996.056338, 138617.292857]
}

experience_salary_means = pd.DataFrame(data)

# Sort the DataFrame in ascending order by 'salary_in_usd'
experience_salary_means = experience_salary_means.sort_values(by='salary_in_usd', ascending=True)
experience_salary_means

In [None]:
# Creating the bar plot using Seaborn
plt.figure(figsize=(5, 3))
sns.barplot(x='experience_level', y='salary_in_usd', data=experience_salary_means)
plt.title('Average Salary by Experience Level')
plt.xlabel('Experience Level')
plt.ylabel('Average Salary (USD)')
plt.show()


In [None]:
#Create graph with smoothened curve to identify saturation points

# Create a line graph with a smooth curve
plt.figure(figsize=(10, 6))

# Interpolate the data to create a smooth curve
x = np.linspace(0, len(experience_salary_means) - 1, 400)  # Create 400 data points
spl = make_interp_spline(range(len(experience_salary_means)), experience_salary_means['salary_in_usd'], k=3)
y_smooth = spl(x)

# Plot the smoothed curve
plt.plot(x, y_smooth, color='b', label='Smoothed Curve')

plt.title('Average Salary by Experience Level (Smoothed)')
plt.xlabel('Experience Level')
plt.ylabel('Average Salary (USD)')

# Add markers for data points
plt.scatter(range(len(experience_salary_means)), experience_salary_means['salary_in_usd'], marker='o', color='r', label='Data Points')

plt.legend()
plt.grid(False)
plt.show()

##**Salary Comparison Based on Experience Summary**

The results obtained give what was expected, with the lowest mean being the entry-level positions, then the mid-level positions follow, then senior and expert roles.

The conclusion that can be drawn for here is, in fields of work that deal with data(and pretty much any other job field around the world), salaries go up as the level of experience goes up.

It's also interesting to note that unlike many other job roles with a salary saturation towards the top(salary increase reducing the higher you climb), in data science its the exact opposite! 
A saturation curve possesses a decreasing slope as you move upwards. As seen in this curve, the data scientist salaries in this dataset has an increasing slope - showing even more significant pay raises with increasing experience levels. This can be confirmed also by looking at the sorted 'experience_salary_means' dataframe and bar plots - It is observed that there is an increasing difference between an experience level and the level just below it as we go up the ladder.


##**Salary Comparison Based on Job Title**

Here, we compare the mean salary per job title in the Data Industry. What job title earns the most? What job title earns the least? What titles are in the middle of the pack of earnings?

Soon enough we'll find out!

In [None]:
#Creating dataframe of salary means and job roles
job_role_salary_means = df.groupby('job_title')['salary_in_usd'].mean()

#Sorting from highest paying job title to least paying job title
sorted_jtdata = job_role_salary_means.sort_values(ascending=False)

#Slicing top 5 average paying job titles from 'sorted_jtdata' DataFrame
top_5_jt = sorted_jtdata[0:5]

#Slicing bottom 5 average paying job titles from the sorted job title ('sorted_jtdata') DataFrame
bottom_5_jt = sorted_jtdata[-6:-1]

#Printing out top 5 and bottom 5 paying job titles by salary averages in the dataset
print(f'''
The top 5 data job titles by salary according to the dataset:

{top_5_jt}


The bottom 5 data job titles according to the dataset:

{bottom_5_jt}''')

##**Salary Comparison Based on Job Title Summary**

In this dataset, we see that the top 5 job titles are Data Analytics Lead, Principal Data Engineer, Fincancial Data Analyst, Principal Data Scientist, and Director of Data Science. From the titles here, the easiest one to break into as a complete beginner is the Financial Data Analyst role, as there are entry-level positions for this role, and all the other are very senior-level or expert level roles.

For the bottom 5 acquired here, I wouldn't talk too much about them as I feel some are very good roles and fairly difficult to break into with good pay as well, but the results were hugely affected by the very sparse dataset.

However, I do feel it's noteworthy that the average salary of a Product Data Analyst in this dataset is twenty times(20x) less than that of a Financial Data Analyst. This is nowhere near reality or my expectations, but it goes to show very convincingly that Financial Data Analysts earn more than Product Data Analysts, despite huge similarities within the job role.

**Role and Experience Salary Comparison**

After comparing salaries to the job title and experience levels individually, what I aim to do here is to sort the data by both the job title & experience levels. For example, all Data Scientists present would be sorted into the different experience levels - EN(entry-level), MI(mid-level), SE(senior level), and EX(expert level) and find the mean salary of each of these job titles per experience level.

After doing this, I would then create visualizations to identify similarity and dispersion patterns.

In [None]:
df['experience_level'].unique()

In [None]:
#Role and Experience Comparison

role_experience_salarydf = df.groupby(['job_title', 'experience_level'])['salary_in_usd'].mean().reset_index()
print(role_experience_salarydf)

As all the job titles are already grouped together, the best way I thought to visualize this was to create a bar plot of the job_title against salary_in_usd that is colour coded by the experience level... Entry-level bars would be green, mid-level yellow, senior level red, and expert level black.

In [None]:
# Define color mapping based on experience_level
color_map = {
    'EN': 'green',
    'MI': 'yellow',
    'SE': 'red',
    'EX': 'black'
}

# Apply color mapping to create a 'color' column in the DataFrame
role_experience_salarydf['color'] = role_experience_salarydf['experience_level'].map(color_map)

# Create the bar plot
plt.figure(figsize=(25, 5))
sns.barplot(x='job_title', y='salary_in_usd', hue='experience_level', data=role_experience_salarydf, palette=color_map.values())
plt.title('Job Title vs. Salary')
plt.xlabel('Job Title')
plt.ylabel('Salary (USD)')
plt.xticks(rotation=45)
plt.legend(title='Experience Level')
plt.show()


**Role and Experience Comparison Summary**

The results obtained from the visualization are somewhat counter-intuitive. To some extent, my expectations on each role, were to have the green bar be the lowest, yellow bar next, followed by the red and then black bars.

Whilst the lowest salary is an entry-level role and the highest salary is an expert role, the salary dispersion between the experience levels in each job role is very uneven and nowhere near expectations. There were several outliers, with the **most notable outliers being entry and mid-level Finance Data Analysts** earning an average of 100,000 USD: more than **ALL** other entry-level, mid-level AND senior-level analysts(including senior level staff in the same role)!!

Though this is likely untrue, the dataset is very sparse, and we have nowhere near enough instances present to draw a logical conclusion, seeing an average of 100,000 USD as entry-level Financial Data Analysts salary inspired me as an aspiring data analyst to look more into financial data analysis, familiarize myself with skills I'll need, and work on more projects to practically position myself to get into the Financial Data Analysis Industry.

# **Regional Analysis**

In this section of the analysis, we would be looking at the effect of geographical location on Data job salaries.

We would be analysing average employee salaries by employee residence first, then by company locations.

Please note that all country locations in the dataset

##**Average Salary by Employee Residence Analysis**

In [None]:
#Create regional salary dataframe with salary means per location
region_salary1 = df.groupby('employee_residence')['salary_in_usd'].mean().reset_index()
# Sort region_salary1 by salaries in descending order, so we have the highest paying locations on the top
region_salary1 = region_salary1.sort_values(by='salary_in_usd', ascending=False)
#Create slice showing top 5 and bottom 5 average salaries by location
region_salary1 = region_salary1.reset_index()
top_5 = region_salary1[0:5]
bottom_5 = region_salary1[-6:-1]

top_5_mean = top_5['salary_in_usd'].mean()
bottom_5_mean = bottom_5['salary_in_usd'].mean()

print(f'The mean salary in the top 5 countries is {round(top_5_mean)} and the mean salary in the bottom 5 is {round(bottom_5_mean)}')

In [None]:
#Data Visualization bar plot for mean salary per employee country of residence
plt.figure(figsize=(15,9))
sns.barplot(x='salary_in_usd', y='employee_residence', data=region_salary1)
plt.title('Average salary by employee residence')
plt.xlabel('Average salary in location')
plt.ylabel('Employee Residence')
plt.show()

In [None]:
#Data Visualization bar plot for top 5 employee country of residence
plt.figure(figsize=(5,3))
sns.barplot(x='salary_in_usd', y='employee_residence', data=top_5)
plt.title('Top 5 employee residence countries by average salary')
plt.xlabel('Average salary in location')
plt.ylabel('Employee Residence')
plt.show()

In [None]:
#Data Visualization bar plot for bottom 5 employee country of residence
plt.figure(figsize=(5,3))
sns.barplot(x='salary_in_usd', y='employee_residence', data=bottom_5)
plt.title('Bottom 5 employee residence countries by average salary')
plt.xlabel('Average salary in location')
plt.ylabel('Employee Residence')
plt.show()

##**Average Salary by Employee Residence Analysis Summary**

As seen in the above plots, the country of employee residence is a highly influential factor on how much the employee is paid. Employees in the data field in **First world countries** with relatively **high cost of living**, **high currency value** and with a **higher demand for data scientist** like Malaysia, the United States(including Puerto Rico), New Zealand, Switzerland, Singapore etc are paid much better than employees in **third world countries** in South America, Africa and Asia where the **cost of living is much less**, **the currency is weaker**, and **the demand for people in the field of data manipulation & analysis** is much lower.

In a different analysis document, I may do a full regression analysis on this matter(The correlation between the country of residence and annual salary).

#Remote Work Impact Analysis

Does working remotely have an impact on staff salaries? Do on-site staff get paid better than remote or hybrid staff? In this section, we would be answering these questions


In [None]:
#Grouping the data by remote ratio
remote_salary = df.groupby('remote_ratio')['salary_in_usd'].mean().reset_index()
#Replacing the numeric 'remote ratio' values with plain English text so it's easier to understand
remote_salary['remote_ratio'] = remote_salary['remote_ratio'].replace({0: 'fully on-site', 50: 'hybrid', 100: 'fully remote'})
remote_salary

##Remote Work Impact on Salary Analysis Summary

From the 'remote_salary' dataframe we can see that on the average, fully remote staff are paid better than their hybrid or on-site counterparts. On-site staff are paid better than hybrid staff but not remote staff, and hybrid staff are paid the least averagely.

This may also be due

##Popularity of remote work Analysis

Which of the job types was most popular in the year 2020 when the data was taken? Given the fact that this was the same year we had the first and second lockdown due to the COVID-19 pandemic, I expect that majority of the staff here would be fully remote or hybrid and a much reduced number would be fully on-site.

Unfortunately, there is no data from previous years to do a time-series analysis on this, but we'd do a shallow analysis on this regardless, and see if our suspicions are correct

In [None]:
#Acquiring count from Dataframe created in last cell
remote_count = df.groupby('remote_ratio')['salary_in_usd'].count().reset_index()
#Renaming the rows and columns appropriately so its easy to read and understand
remote_count.rename(columns ={'salary_in_usd':'staff_count'}, inplace = True)
remote_count['remote_ratio'] = remote_count['remote_ratio'].replace({0: 'fully on-site', 50: 'hybrid', 100: 'fully remote'})
remote_count

In [None]:
#Visualizing the remote_count dataframe in a pie chart
plt.figure(figsize=(8, 8))
plt.pie(remote_count['staff_count'], labels=remote_count['remote_ratio'], autopct='%1.1f%%', startangle=140, colors=['lightcoral', 'lightskyblue', 'lightgreen'])
plt.title('Remote Work Distribution')
plt.axis('equal')
plt.show()

##Popularity of remote work summary

Our expectations were met, as we saw 62.3%(more than 3 in every 5) data staff work fully remote in 2020. The major cause of this as we had earlier highlighted would very likely be due to the multiple lockdowns and the massive spread of the COVID-19 pandemic in the year. Suprisingly, 20.9% of staff(a tiny bit more than 1 in every 5) stayed fully on-site through the year, and the hybrid work-style(which was expected to be next in popularity after remote work) was the least popular with only 16.3%(a little less than one in every 6) of staff doing the hybrid working style.

In conclusion, considering the fact that remote working was not in any way popular before the COVID-19 pandemic(honestly would have hoped to have data from previous years here so I could do a time-series analysis), remote working(inlcuding fully remote and hybrid working) became very popular with 79.1% of data staff(about 4 in every 5) working either fully or partially remote in 2020, and just 20.9%(about 1 in 5) working fully on-site. Considering that fully on-site was by far the most popular working style pre-pandemic, it's safe to say the COVID-19 pandemic has had a major inmpact on how we work, of which the effects are still being felt today. 


#**Conclusion**

In the realm of data-driven industries, understanding salary dynamics is pivotal for organizations and individuals alike. The "Data Staff Salary Analysis" project has embarked on an illuminating journey through the intricacies of compensation within the data space, unearthing valuable insights and trends that can shape the future of talent management and career decisions.

Throughout our analysis, we have harnessed the power of descriptive statistics to comprehend the nuances of salary distributions. From frequency tables to histograms, we visualized these distributions, revealing the diversity in compensation ranges that exist within the data field.

Our exploration of job roles provided a comprehensive view of the data job landscape. By identifying predominant roles and examining their compensation, we have not only broadened our understanding of job variety but also unveiled critical information that can inform career choices and talent acquisition strategies.

Delving deeper, our experience-based salary comparisons have highlighted the intricate relationship between years of experience and compensation levels. These insights are invaluable for both individuals looking to negotiate their salaries and organizations aiming to benchmark their compensation packages.

The analysis of salaries based on job titles has brought forth a nuanced perspective on how different roles are rewarded within the data sector. This information can guide both job seekers and employers in making informed decisions.

Regional influences on salaries have been thoroughly investigated, underlining the significance of geography in compensation. These findings have the potential to reshape hiring strategies and salary structures to reflect regional disparities.

Finally, our examination of remote work trends has shown the increasing popularity of this mode of employment within the data industry. This trend not only offers opportunities for greater work-life balance but also presents fresh challenges and considerations in salary negotiation and management.

In essence, the "Data Staff Salary Analysis" project has offered a multifaceted view of compensation within the data field. It is our hope that the insights uncovered here will empower individuals to make informed career choices and organizations to optimize their talent strategies. By understanding the dynamics of salaries in this ever-evolving landscape, we pave the way for a more equitable, competitive, and prosperous future in the data industry.

As data continues to shape our world, let these findings be a guiding light for those navigating their path within this exciting and dynamic field.