# Data scientist salaries analysis
*By Francisco Javier Sánchez Panduro*\
*Supervised by Professor Doctor Brenda García Maya*\
*Monterrey Institute of Tecnology and Higher Studies*\
*13 of August 2023*

Using the linear regression model, predict salaries in dollars for data scientists. Using the features experience level, salary, type of job and remote radio.

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
from scipy import stats
import statsmodels.api as sm

## Data preparation

(Bhatia, n.d.)

The data includes 12 columns, here explained
| Column | Description |
|---|---|
|work_year	| The year the salary was paid.|
|experience_level|	The experience level in the job during the year with the following possible values: EN Entry-level / Junior MI Mid-level / Intermediate SE Senior-level / Expert EX Executive-level / Director|
|employment_type|	The type of employement for the role: PT Part-time FT Full-time CT Contract FL Freelance|
|job_title	|The role worked in during the year.|
|salary	|The total gross salary amount paid.|
|salary_currency|	The currency of the salary paid as an ISO 4217 currency code.|
|salary_in_usd|	The salary in USD (FX rate divided by avg. USD rate for the respective year via fxdata.foorilla.com).|
|employee_residence|	Employee's primary country of residence in during the work year as an ISO 3166 country code.|
|remote_ratio|	The overall amount of work done remotely, possible values are as follows: 0 No remote work (less than 20%) 50 Partially remote 100 Fully remote (more than 80%)|
|company_location|	The country of the employer's main office or contracting branch as an ISO 3166 country code.|
|company_size|	The average number of people that worked for the company during the year: S less than 50 employees (small) M 50 to 250 employees (medium) L more than 250 employees (large)|

In [2]:
df = pd.read_csv('ds_salaries.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,0,2020,MI,FT,Data Scientist,70000,EUR,79833,DE,0,DE,L
1,1,2020,SE,FT,Machine Learning Scientist,260000,USD,260000,JP,0,JP,S
2,2,2020,SE,FT,Big Data Engineer,85000,GBP,109024,GB,50,GB,M
3,3,2020,MI,FT,Product Data Analyst,20000,USD,20000,HN,0,HN,S
4,4,2020,SE,FT,Machine Learning Engineer,150000,USD,150000,US,50,US,L


In [3]:
print(df.shape)

(607, 12)


In [4]:
df.isnull().sum()

Unnamed: 0            0
work_year             0
experience_level      0
employment_type       0
job_title             0
salary                0
salary_currency       0
salary_in_usd         0
employee_residence    0
remote_ratio          0
company_location      0
company_size          0
dtype: int64

In [5]:
# Create dataframe with only relevant data
df = pd.DataFrame({'experience_level': df['experience_level'], 'employment_type' : df['employment_type'], 'salary_in_usd' : df['salary_in_usd'], 'salary' : df['salary'], 'remote_ratio' : df['remote_ratio']})
df.head()

Unnamed: 0,experience_level,employment_type,salary_in_usd,salary,remote_ratio
0,MI,FT,79833,70000,0
1,SE,FT,260000,260000,0
2,SE,FT,109024,85000,50
3,MI,FT,20000,20000,0
4,SE,FT,150000,150000,50


In [6]:
print(df['experience_level'].unique())

['MI' 'SE' 'EN' 'EX']


In [7]:
print(df['employment_type'].unique())

['FT' 'CT' 'PT' 'FL']


In [8]:
# Create dummy variables to represent categorical data in numerical form
dummies_experience_level = pd.get_dummies(df['experience_level'], prefix='experience_level', dtype = 'uint8')
dummies_experience_level.head()

Unnamed: 0,experience_level_EN,experience_level_EX,experience_level_MI,experience_level_SE
0,0,0,1,0
1,0,0,0,1
2,0,0,0,1
3,0,0,1,0
4,0,0,0,1


In [9]:
dummies_employment_type = pd.get_dummies(df['employment_type'], prefix='employment_type', dtype = 'uint8')
dummies_employment_type.head()

Unnamed: 0,employment_type_CT,employment_type_FL,employment_type_FT,employment_type_PT
0,0,0,1,0
1,0,0,1,0
2,0,0,1,0
3,0,0,1,0
4,0,0,1,0


In [10]:
df = pd.concat([df, dummies_employment_type, dummies_experience_level], axis=1)
df.drop('experience_level', axis = 1, inplace=True)
df.drop('employment_type', axis = 1, inplace=True)
df.head()

Unnamed: 0,salary_in_usd,salary,remote_ratio,employment_type_CT,employment_type_FL,employment_type_FT,employment_type_PT,experience_level_EN,experience_level_EX,experience_level_MI,experience_level_SE
0,79833,70000,0,0,0,1,0,0,0,1,0
1,260000,260000,0,0,0,1,0,0,0,0,1
2,109024,85000,50,0,0,1,0,0,0,0,1
3,20000,20000,0,0,0,1,0,0,0,1,0
4,150000,150000,50,0,0,1,0,0,0,0,1


## Citations
Bhatia, R. (n.d.). Data Science Job Salaries, V1.0. Retrieved August 11, 2023 from <a href="https://www.kaggle.com/datasets/ruchi798/data-science-job-salaries">https://www.kaggle.com/datasets/ruchi798/data-science-job-salaries</a>.

---
Francisco Javier Sánchez Panduro A01639832