<a href="https://www.kaggle.com/code/prasadposture121/analysis-of-data-science-jobs?scriptVersionId=131249160" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Analysis of Data Science Jobs

# Introduction
The amount of data we have been generating from the past few years has been incresed exponetially. As John Naisbitt once said, "We are drowning in information but starving for knowledge." But to conquer this, there has been a rapid development in the field of data science. This field makes use of mainly statistics, mathematics, scientific methods, computer programming, machine learning, deep learning and many more things to extract knowledge, generate insights and derive patterns from the noisy, unstrutured data. The growth of this field has brought many job opportunites across the world, due to this, the data science has become the most sought after career option today. This field will continue to grow, since the need of analayzing the mammoth  amount of data is increasing day-by-day. There are various job titles among this field and to understand about them, here I have data of 3755 job titles from the companies all across the world. I will perform question based analysis to generate various insights and to see what are the main factors that affect the relation of job titles and their salaries. If you find this notebook useful please upvote and if you have any sugesstions or queries please feel free to contact me. Thank you.

# Importing Dependencies

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
import plotly.express as px

# Loading the Data

In [2]:
df = pd.read_csv('/kaggle/input/data-science-salaries-2023/ds_salaries.csv')
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,SE,FT,Principal Data Scientist,80000,EUR,85847,ES,100,ES,L
1,2023,MI,CT,ML Engineer,30000,USD,30000,US,100,US,S
2,2023,MI,CT,ML Engineer,25500,USD,25500,US,100,US,S
3,2023,SE,FT,Data Scientist,175000,USD,175000,CA,100,CA,M
4,2023,SE,FT,Data Scientist,120000,USD,120000,CA,100,CA,M


In [3]:
# Shape of the dataset i.e. No of rows and columns
df.shape

(3755, 11)

In [4]:
# General Information of the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3755 entries, 0 to 3754
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   work_year           3755 non-null   int64 
 1   experience_level    3755 non-null   object
 2   employment_type     3755 non-null   object
 3   job_title           3755 non-null   object
 4   salary              3755 non-null   int64 
 5   salary_currency     3755 non-null   object
 6   salary_in_usd       3755 non-null   int64 
 7   employee_residence  3755 non-null   object
 8   remote_ratio        3755 non-null   int64 
 9   company_location    3755 non-null   object
 10  company_size        3755 non-null   object
dtypes: int64(4), object(7)
memory usage: 322.8+ KB


In [5]:
# Statistical Information of the Numeric Attributes
df.describe()

Unnamed: 0,work_year,salary,salary_in_usd,remote_ratio
count,3755.0,3755.0,3755.0,3755.0
mean,2022.373635,190695.6,137570.38988,46.271638
std,0.691448,671676.5,63055.625278,48.58905
min,2020.0,6000.0,5132.0,0.0
25%,2022.0,100000.0,95000.0,0.0
50%,2022.0,138000.0,135000.0,0.0
75%,2023.0,180000.0,175000.0,100.0
max,2023.0,30400000.0,450000.0,100.0


In [6]:
# Statistical Information of the Categorical Attributes
df.describe(include=['O'])

Unnamed: 0,experience_level,employment_type,job_title,salary_currency,employee_residence,company_location,company_size
count,3755,3755,3755,3755,3755,3755,3755
unique,4,4,93,20,78,72,3
top,SE,FT,Data Engineer,USD,US,US,M
freq,2516,3718,1040,3224,3004,3040,3153


In [7]:
# Checking for the missing values
df.isna().sum()

work_year             0
experience_level      0
employment_type       0
job_title             0
salary                0
salary_currency       0
salary_in_usd         0
employee_residence    0
remote_ratio          0
company_location      0
company_size          0
dtype: int64

# Data Manipulation and Feature Engineering

### Converting Country Codes (ISO3)

In [8]:
! pip install country_converter

Collecting country_converter
  Downloading country_converter-1.0.0-py3-none-any.whl (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.5/44.5 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: country_converter
Successfully installed country_converter-1.0.0
[0m

In [9]:
# Converting country codes of employee_residence column
import country_converter as coco
df['employee_residence'] = coco.convert(names=df['employee_residence'])
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,SE,FT,Principal Data Scientist,80000,EUR,85847,ESP,100,ES,L
1,2023,MI,CT,ML Engineer,30000,USD,30000,USA,100,US,S
2,2023,MI,CT,ML Engineer,25500,USD,25500,USA,100,US,S
3,2023,SE,FT,Data Scientist,175000,USD,175000,CAN,100,CA,M
4,2023,SE,FT,Data Scientist,120000,USD,120000,CAN,100,CA,M


In [10]:
# Converting country codes of company_location column
import country_converter as coco
df['company_location'] = coco.convert(names=df['company_location'])
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,SE,FT,Principal Data Scientist,80000,EUR,85847,ESP,100,ESP,L
1,2023,MI,CT,ML Engineer,30000,USD,30000,USA,100,USA,S
2,2023,MI,CT,ML Engineer,25500,USD,25500,USA,100,USA,S
3,2023,SE,FT,Data Scientist,175000,USD,175000,CAN,100,CAN,M
4,2023,SE,FT,Data Scientist,120000,USD,120000,CAN,100,CAN,M


### Foreign Employees
Adding a new column based on the condition that whether the country of residence of employee and the country where the company is based are same or not. If same then the employee is not a foreign employee if not then he/she is a foreign employee.

In [11]:
df['foreign_employee']=np.where(df['employee_residence']==df['company_location'],"No","Yes")
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size,foreign_employee
0,2023,SE,FT,Principal Data Scientist,80000,EUR,85847,ESP,100,ESP,L,No
1,2023,MI,CT,ML Engineer,30000,USD,30000,USA,100,USA,S,No
2,2023,MI,CT,ML Engineer,25500,USD,25500,USA,100,USA,S,No
3,2023,SE,FT,Data Scientist,175000,USD,175000,CAN,100,CAN,M,No
4,2023,SE,FT,Data Scientist,120000,USD,120000,CAN,100,CAN,M,No


### Currency Conversion Rates
We have been given the salaries in different currerncies, also there is another column which contains the salaries in USD. We can use the information to determine the conversion rates between the different currencies. We will use USD as a standard currency here.

In [12]:
df['conversion_rates'] = df['salary']/df['salary_in_usd']
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size,foreign_employee,conversion_rates
0,2023,SE,FT,Principal Data Scientist,80000,EUR,85847,ESP,100,ESP,L,No,0.93189
1,2023,MI,CT,ML Engineer,30000,USD,30000,USA,100,USA,S,No,1.0
2,2023,MI,CT,ML Engineer,25500,USD,25500,USA,100,USA,S,No,1.0
3,2023,SE,FT,Data Scientist,175000,USD,175000,CAN,100,CAN,M,No,1.0
4,2023,SE,FT,Data Scientist,120000,USD,120000,CAN,100,CAN,M,No,1.0


### Experience Level
It is given in short format, we will elaborate it using replace method.

In [13]:
df['experience_level'].replace(['SE', 'MI', 'EN', 'EX'],
                               ['Senior-level','Mid-level','Entry-level','Executive-level'],
                               inplace=True)
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size,foreign_employee,conversion_rates
0,2023,Senior-level,FT,Principal Data Scientist,80000,EUR,85847,ESP,100,ESP,L,No,0.93189
1,2023,Mid-level,CT,ML Engineer,30000,USD,30000,USA,100,USA,S,No,1.0
2,2023,Mid-level,CT,ML Engineer,25500,USD,25500,USA,100,USA,S,No,1.0
3,2023,Senior-level,FT,Data Scientist,175000,USD,175000,CAN,100,CAN,M,No,1.0
4,2023,Senior-level,FT,Data Scientist,120000,USD,120000,CAN,100,CAN,M,No,1.0


### Employment Type
We will also elaborate this one

In [14]:
df['employment_type'].replace(['FT', 'CT', 'FL', 'PT'],['Full-time','Contract','Freelancer','Part-time'], inplace=True)
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size,foreign_employee,conversion_rates
0,2023,Senior-level,Full-time,Principal Data Scientist,80000,EUR,85847,ESP,100,ESP,L,No,0.93189
1,2023,Mid-level,Contract,ML Engineer,30000,USD,30000,USA,100,USA,S,No,1.0
2,2023,Mid-level,Contract,ML Engineer,25500,USD,25500,USA,100,USA,S,No,1.0
3,2023,Senior-level,Full-time,Data Scientist,175000,USD,175000,CAN,100,CAN,M,No,1.0
4,2023,Senior-level,Full-time,Data Scientist,120000,USD,120000,CAN,100,CAN,M,No,1.0


### Job Type
Using the remote_ratio, we will replace the values with remote(100%), on-site(0%) and hybrid(50%).

In [15]:
df['remote_ratio'].replace([0,50,100],['On-Site','Hybrid','Remote'], inplace=True)
df.rename(columns={'remote_ratio':'job_type'}, inplace=True) #Renaming the column
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,job_type,company_location,company_size,foreign_employee,conversion_rates
0,2023,Senior-level,Full-time,Principal Data Scientist,80000,EUR,85847,ESP,Remote,ESP,L,No,0.93189
1,2023,Mid-level,Contract,ML Engineer,30000,USD,30000,USA,Remote,USA,S,No,1.0
2,2023,Mid-level,Contract,ML Engineer,25500,USD,25500,USA,Remote,USA,S,No,1.0
3,2023,Senior-level,Full-time,Data Scientist,175000,USD,175000,CAN,Remote,CAN,M,No,1.0
4,2023,Senior-level,Full-time,Data Scientist,120000,USD,120000,CAN,Remote,CAN,M,No,1.0


# Question Based Analysis

<b> 1) Which are the top 10 popular job designations?

In [16]:
df['job_title'].value_counts().head(10)

Data Engineer                1040
Data Scientist                840
Data Analyst                  612
Machine Learning Engineer     289
Analytics Engineer            103
Data Architect                101
Research Scientist             82
Data Science Manager           58
Applied Scientist              58
Research Engineer              37
Name: job_title, dtype: int64

In [17]:
px.bar(x=df['job_title'].value_counts().head(10).index,
       y=df['job_title'].value_counts().head(10),
       title='Top 10 Most Popular Job Designations',
      labels={'y':'No. of posts','x':'Job Designations'})

Data Engineer is one of the most popular job designation followed by Data Scientist then Data Analyst.

<b> 2) Which are the top 10 highest paid job designations over the years?

In [18]:
xdf=df.groupby(['job_title'])['salary_in_usd'].median().sort_values(ascending=False).head(10)
xdf

job_title
Data Science Tech Lead                 375000.0
Cloud Data Architect                   250000.0
Data Lead                              212500.0
Data Analytics Lead                    211254.5
Head of Data                           202500.0
Principal Data Engineer                192500.0
Applied Scientist                      191737.5
Principal Machine Learning Engineer    190000.0
Data Science Manager                   183780.0
Data Infrastructure Engineer           183655.0
Name: salary_in_usd, dtype: float64

In [19]:
px.bar(x=xdf.index, y=xdf, title='Top 10 High Paying Job Designations',
      labels={'y':'Median Salary','x':'Job Designations'})

#### Work in Progress