<img src = "https://www.abgconsultoria.com.br/blog/wp-content/uploads/171127-Data-Science.jpg"> <br>
# <center>🧑‍💻 Data Science Job Salaries | EDA 📊 📈 </center>

# Introduction <br>
This notebook is a study on the <a href="https://www.kaggle.com/datasets/ruchi798/data-science-job-salaries">Data Science Job Salaries</a> dataset, containing information on salary, company size and location, remote work, employee location, and many other related to people working in the Data Science field throughout the world.<br><br>
The goal of this notebook is to do an EDA on the data, trying to understand how the feature salary in usd relates to other features and try to understand their relevance when it comes to achieving higher yearly earnings.<br><br>
After doing an EDA, I'll use **PyCaret** regression lib to predict employee salaries in usd according to the features analyzed in the dataset.

In [245]:
# Installing PyCaret
#!pip install --ignore-installed pycaret --user

In [246]:
# Importing Libraries
import pandas as pd, numpy as np, plotly.express as px,plotly.figure_factory as ff
from pycaret.regression import *

In [247]:
# Loading data 
df = pd.read_csv('../input/data-science-job-salaries/ds_salaries.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,0,2020,MI,FT,Data Scientist,70000,EUR,79833,DE,0,DE,L
1,1,2020,SE,FT,Machine Learning Scientist,260000,USD,260000,JP,0,JP,S
2,2,2020,SE,FT,Big Data Engineer,85000,GBP,109024,GB,50,GB,M
3,3,2020,MI,FT,Product Data Analyst,20000,USD,20000,HN,0,HN,S
4,4,2020,SE,FT,Machine Learning Engineer,150000,USD,150000,US,50,US,L


In [248]:
# Droping Unnamed column
df.drop('Unnamed: 0', axis = 1, inplace=True)
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2020,MI,FT,Data Scientist,70000,EUR,79833,DE,0,DE,L
1,2020,SE,FT,Machine Learning Scientist,260000,USD,260000,JP,0,JP,S
2,2020,SE,FT,Big Data Engineer,85000,GBP,109024,GB,50,GB,M
3,2020,MI,FT,Product Data Analyst,20000,USD,20000,HN,0,HN,S
4,2020,SE,FT,Machine Learning Engineer,150000,USD,150000,US,50,US,L


In [249]:
# Printing information on the dataset
print(f'The dataset has {df.shape[0]} rows and {df.shape[1]} columns\n')
print('- -' * 30)
print('Value counts for each columns: \n')
for i in df.columns:
    print(f'===== {i} =====\n')
    print(df[i].value_counts().sort_values(ascending=False))
    print('- -' * 30)

The dataset has 607 rows and 11 columns

- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -
Value counts for each columns: 

===== work_year =====

2022    318
2021    217
2020     72
Name: work_year, dtype: int64
- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -
===== experience_level =====

SE    280
MI    213
EN     88
EX     26
Name: experience_level, dtype: int64
- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -
===== employment_type =====

FT    588
PT     10
CT      5
FL      4
Name: employment_type, dtype: int64
- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -
===== job_title =====

Data Scientist                              143
Data Engineer                               132
Data Analyst                                 97
Machine Learning Engineer                    41
Research Scientist                           1

In [250]:
# Renaming attributes 
df['experience_level'] = df['experience_level'].map({'SE':'Senior', 'MI':'Intermediate','EN':'Junior', 'EX':'Executive - Director'})
df['employment_type'] = df['employment_type'].map({'PT':'Part-time','FT': 'Full-time', 'CT': 'Contract','FL':'Freelance'})
df['company_size'] = df['company_size'].map({'M': 'Medium', 'L': 'Large', 'S':'Small'})
# Visualize DF
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2020,Intermediate,Full-time,Data Scientist,70000,EUR,79833,DE,0,DE,Large
1,2020,Senior,Full-time,Machine Learning Scientist,260000,USD,260000,JP,0,JP,Small
2,2020,Senior,Full-time,Big Data Engineer,85000,GBP,109024,GB,50,GB,Medium
3,2020,Intermediate,Full-time,Product Data Analyst,20000,USD,20000,HN,0,HN,Small
4,2020,Senior,Full-time,Machine Learning Engineer,150000,USD,150000,US,50,US,Large


In [251]:
# Checking employees resident in Brazil
df.query("salary_currency == 'BRL'")

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
205,2021,Intermediate,Full-time,Data Scientist,69600,BRL,12901,BR,0,BR,Small
271,2021,Senior,Full-time,Computer Vision Engineer,102000,BRL,18907,BR,0,BR,Medium


In [252]:
# Checking for null values
df.isnull().sum()

work_year             0
experience_level      0
employment_type       0
job_title             0
salary                0
salary_currency       0
salary_in_usd         0
employee_residence    0
remote_ratio          0
company_location      0
company_size          0
dtype: int64

In [253]:
# Checking for '0' in salaries
df.query("salary_in_usd == 0")

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size


Apparently, we have **no missing values** nor any salary_in_usd value registered as *0*. <br><br>
Since we have salaries in USD for every employee, regardless of its local currency salary, I'll drop both *salary* and *salary_currency* columns.

In [254]:
# Dropping columns 
df.drop(['salary','salary_currency'], axis = 1, inplace = True)
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2020,Intermediate,Full-time,Data Scientist,79833,DE,0,DE,Large
1,2020,Senior,Full-time,Machine Learning Scientist,260000,JP,0,JP,Small
2,2020,Senior,Full-time,Big Data Engineer,109024,GB,50,GB,Medium
3,2020,Intermediate,Full-time,Product Data Analyst,20000,HN,0,HN,Small
4,2020,Senior,Full-time,Machine Learning Engineer,150000,US,50,US,Large


In [255]:
# Boxplot for salaries according to experience level
fig = px.box(df,x = 'salary_in_usd',color = 'experience_level',y = 'experience_level',template='seaborn',
            title = 'Yearly salaries in USD per experience level')
fig.show()

We can definitely see that those at the Executive-Director level have higher earnings, followed by those at the Senior level, while those at the Junior level have the lowest earnings of them all. <br>
There seems to be some outliers at the Intermediate level having higher earnings than the highest amount of earnings among those that are at the Senior level.

In [256]:
# Distribution of experience levels in the dataset
fig = px.pie(df, names = 'experience_level', template = 'seaborn')
fig.show()

Many employees in the dataset are either at the Senior level or the Intermediate level.

In [257]:
# Where are located the employees with the highest earnings?
avg_earnings_per_country = pd.DataFrame(df.groupby('employee_residence')[['employee_residence','salary_in_usd']].mean().sort_values('salary_in_usd', ascending=False).round(2).head(10))
fig = px.bar(avg_earnings_per_country, x = avg_earnings_per_country.index, y = 'salary_in_usd',
            title = 'Average earnings according to employee residence', template = 'seaborn', color = avg_earnings_per_country.index, text = 'salary_in_usd')
fig.show()
avg_earnings_per_country

Unnamed: 0_level_0,salary_in_usd
employee_residence,Unnamed: 1_level_1
MY,200000.0
PR,160000.0
US,149194.12
NZ,125000.0
CH,122346.0
AU,108042.67
RU,105750.0
SG,104176.5
JP,103537.71
AE,100000.0


According to ISO 3166 country codes, employees resident in Malaysia, Puerto Rico, the United States, New Zealand, Switzerland, Australia, Russia, Singapore, Japan and the United Arab Emirates have higher earning on average

In [258]:
# Among those with higher earnings, where are located the companies they work for?
avg_payments_per_country = pd.DataFrame(df.groupby('company_location')[['company_location','salary_in_usd']].mean().sort_values('salary_in_usd', ascending=False).round(2).head(10))
fig = px.bar(avg_payments_per_country, x = avg_payments_per_country.index, y = 'salary_in_usd',
            title = 'Average earnings according to company location', template = 'seaborn', color = avg_payments_per_country.index, text = 'salary_in_usd')
fig.show()
avg_payments_per_country

Unnamed: 0_level_0,salary_in_usd
company_location,Unnamed: 1_level_1
RU,157500.0
US,144055.26
NZ,125000.0
IL,119059.0
JP,114127.33
AU,108042.67
AE,100000.0
DZ,100000.0
IQ,100000.0
CA,99823.73


People who work for companies located in Russia, the United States, New Zealand, Israel, Japan, Australia, the United Arab Emirates, Algeria, Iraq and Canada have higher earnings on average.

In [259]:
# What employment type has higher earnings
fig = px.box(df, y = 'salary_in_usd', x = 'employment_type', template = 'seaborn', color = 'employment_type',
              title = 'Salaries according to employment type')
fig.show()
df.groupby('employment_type').salary_in_usd.describe().round(2)

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
employment_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Contract,5.0,184575.0,156250.89,31875.0,100000.0,105000.0,270000.0,416000.0
Freelance,4.0,48000.0,40529.82,12000.0,18000.0,40000.0,70000.0,100000.0
Full-time,588.0,113468.07,69476.47,2859.0,64962.25,104196.5,150000.0,600000.0
Part-time,10.0,33070.5,31472.91,5409.0,12000.0,18817.5,48370.0,100000.0


There are much more employees full-time than any other type, however, it seems those working on contracts have very high earnings, with the minimum value on contract being much higher than the minimum value of employees working on a full-time employment type.

In [260]:
# Checking salaries for those working on a contract employment type
df.query("employment_type == 'Contract'").sort_values('salary_in_usd', ascending = True)

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
489,2022,Junior,Contract,Applied Machine Learning Scientist,31875,TN,100,CZ,Medium
28,2020,Junior,Contract,Business Data Analyst,100000,US,100,US,Large
283,2021,Senior,Contract,Staff Data Scientist,105000,US,100,US,Medium
78,2021,Intermediate,Contract,ML Engineer,270000,US,100,US,Large
225,2021,Executive - Director,Contract,Principal Data Scientist,416000,US,100,US,Small


In [261]:
# # Checking salaries for those working on a full-time employment type
df.query("employment_type == 'Full-time'").sort_values('salary_in_usd', ascending = True).head(5)

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
176,2021,Intermediate,Full-time,Data Scientist,2859,MX,0,MX,Small
185,2021,Intermediate,Full-time,Data Engineer,4000,IR,100,IR,Medium
238,2021,Junior,Full-time,Data Scientist,4000,VN,0,VN,Medium
179,2021,Intermediate,Full-time,Data Scientist,5679,IN,100,US,Small
18,2020,Junior,Full-time,Data Science Consultant,5707,IN,50,IN,Medium


In [262]:
# How are salaries distributed among each remote_ratio
fig = px.box(df, x = 'remote_ratio', y = 'salary_in_usd', template = 'seaborn', color = 'remote_ratio',
            title = 'Salary in USD according to remote_ratio')
fig.show()
remote_ratio_average = df.groupby('remote_ratio')[['remote_ratio','salary_in_usd']].mean().sort_values('salary_in_usd', ascending=False)
remote_ratio_average.round(0)

Unnamed: 0_level_0,remote_ratio,salary_in_usd
remote_ratio,Unnamed: 1_level_1,Unnamed: 2_level_1
100,100.0,122457.0
0,0.0,106355.0
50,50.0,80823.0


Considering that: <br><br>
- 0 = Not remote <br>
- 50 = Partially remote<br>
- 100 = Fully remote<br><br>

It seems that those who work fully remote may have higher earnings.

In [263]:
# Among those with higher earnings, what is the size of the companies they work for?
company_size = pd.DataFrame(df.groupby('company_size')[['company_size','salary_in_usd']].mean().sort_values('salary_in_usd', ascending=False).round(2).head(10))
fig = px.bar(company_size, x = company_size.index, y = 'salary_in_usd',
            title = 'Average earnings according to company size', template = 'seaborn', color = company_size.index, text = 'salary_in_usd')
fig.show()
company_size

Unnamed: 0_level_0,salary_in_usd
company_size,Unnamed: 1_level_1
Large,119242.99
Medium,116905.47
Small,77632.67


Large and Medium companies both pay much more on average than small companies

In [264]:
# What are the titles with the highest earnings on average?
avg_earnings_per_title = pd.DataFrame(df.groupby('job_title')[['job_title','salary_in_usd']].mean().sort_values('salary_in_usd', ascending=False).round(2).head(10))
fig = px.bar(avg_earnings_per_title, x = avg_earnings_per_title.index, y = 'salary_in_usd',
            title = 'Average earnings according to job title', template = 'seaborn', color = avg_earnings_per_title.index, text = 'salary_in_usd')
fig.show()
avg_earnings_per_title

Unnamed: 0_level_0,salary_in_usd
job_title,Unnamed: 1_level_1
Data Analytics Lead,405000.0
Principal Data Engineer,328333.33
Financial Data Analyst,275000.0
Principal Data Scientist,215242.43
Director of Data Science,195074.0
Data Architect,177873.91
Applied Data Scientist,175655.0
Analytics Engineer,175000.0
Data Specialist,165000.0
Head of Data,160162.6


In [265]:
# Generating full report on data
import pandas_profiling as pp
pp.ProfileReport(df)

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



In [266]:
# How is the target variable distributed?
hist_data = [df['salary_in_usd']]
group_labels = ['salary_in_usd']
fig = ff.create_distplot(hist_data, group_labels, show_hist=False)
fig.update_layout(title = 'Distribution of target variable: salary in USD')
fig.layout.template = 'seaborn'
fig.show()

Salaries in USD are not distributed in a normal/gaussian distribution.<br>
We're gonna fix that later on...

In [267]:
# Creating a new dataframe with some data from dataset to test model later on
print(f'rows in the dataset: {df.shape[0]}')
print(f'20% = {df.shape[0]* 0.2}')

rows in the dataset: 607
20% = 121.4


In [268]:
unseen_data = df.tail(122)

In [269]:
df = df.drop(df.tail(122).index)
df

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2020,Intermediate,Full-time,Data Scientist,79833,DE,0,DE,Large
1,2020,Senior,Full-time,Machine Learning Scientist,260000,JP,0,JP,Small
2,2020,Senior,Full-time,Big Data Engineer,109024,GB,50,GB,Medium
3,2020,Intermediate,Full-time,Product Data Analyst,20000,HN,0,HN,Small
4,2020,Senior,Full-time,Machine Learning Engineer,150000,US,50,US,Large
...,...,...,...,...,...,...,...,...,...
480,2022,Senior,Full-time,Machine Learning Engineer,120000,AE,100,AE,Small
481,2022,Senior,Full-time,Machine Learning Engineer,65000,AE,100,AE,Small
482,2022,Executive - Director,Full-time,Data Engineer,324000,US,100,US,Medium
483,2022,Executive - Director,Full-time,Data Engineer,216000,US,100,US,Medium


# Using Pycaret for Salaries prediction

In [273]:
from pycaret.regression import *
setup(data = df, # Dataframe for training/validation split
      session_id = 4588,
      target = 'salary_in_usd', # Defining target variable 
      normalize = True, # Normalizing data
      remove_outliers = True, # Removing outliers
      fold = 5,
      transform_target = True, # Transforming target variable to a normal/gaussian distribution 
      transformation = True, # Transforming distribution of all other features
     ordinal_features = {'experience_level' : ['Junior','Intermediate','Senior', 'Executive - Director'], 
'company_size' : ['Small','Medium','Large']},# Ordinal edncoding
      high_cardinality_features = ['employee_residence']) # Encoding of high cardinality features

Unnamed: 0,Description,Value
0,session_id,4588
1,Target,salary_in_usd
2,Original Data,"(485, 9)"
3,Missing Values,False
4,Numeric Features,0
5,Categorical Features,8
6,Ordinal Features,True
7,High Cardinality Features,True
8,High Cardinality Method,frequency
9,Transformed Train Set,"(322, 74)"


({'lr': <pycaret.containers.models.regression.LinearRegressionContainer at 0x7f5e0be4e950>,
  'lasso': <pycaret.containers.models.regression.LassoRegressionContainer at 0x7f5e0be4e510>,
  'ridge': <pycaret.containers.models.regression.RidgeRegressionContainer at 0x7f5e0be4e450>,
  'en': <pycaret.containers.models.regression.ElasticNetContainer at 0x7f5e0be4e410>,
  'lar': <pycaret.containers.models.regression.LarsContainer at 0x7f5e0be4e2d0>,
  'llar': <pycaret.containers.models.regression.LassoLarsContainer at 0x7f5e0be4e090>,
  'omp': <pycaret.containers.models.regression.OrthogonalMatchingPursuitContainer at 0x7f5e0be52d90>,
  'br': <pycaret.containers.models.regression.BayesianRidgeContainer at 0x7f5e0be4e1d0>,
  'ard': <pycaret.containers.models.regression.AutomaticRelevanceDeterminationContainer at 0x7f5e0ba88250>,
  'par': <pycaret.containers.models.regression.PassiveAggressiveRegressorContainer at 0x7f5e0be4e110>,
  'ransac': <pycaret.containers.models.regression.RANSACRegresso

In [274]:
top_3 = compare_models(n_select = 3, sort = 'MAE')

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
br,Bayesian Ridge,32348.2342,2779751275.3179,49854.0483,0.5201,0.4828,0.4418,0.024
ridge,Ridge Regression,32623.8111,2814349680.5608,50296.4672,0.5105,0.4878,0.4428,0.022
gbr,Gradient Boosting Regressor,33516.9608,2942854167.6662,51509.6327,0.4839,0.4808,0.4473,0.05
huber,Huber Regressor,33621.8468,3064824343.563,53050.8204,0.4525,0.5018,0.4349,0.04
rf,Random Forest Regressor,33709.0613,3072872250.5704,52709.7454,0.4525,0.4929,0.4531,0.222
catboost,CatBoost Regressor,34307.4219,2998326252.8221,52331.0782,0.4608,0.4901,0.4582,1.166
omp,Orthogonal Matching Pursuit,34788.6516,3149245781.5153,53459.4816,0.4461,0.5331,0.531,0.022
knn,K Neighbors Regressor,35132.494,3147718941.3179,53126.4677,0.4496,0.5044,0.4555,0.054
xgboost,Extreme Gradient Boosting,35584.5162,3225230118.7444,54772.5146,0.3953,0.5133,0.479,0.214
lightgbm,Light Gradient Boosting Machine,36470.1801,3225665187.3408,54183.4592,0.4264,0.5415,0.5171,0.286


Bayesian Ridge, Ridge Regression and Gradient Boosting Regressor models had the best MAE score.
Let's create these models!

In [275]:
br = create_model('br')

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,32565.0705,1968612668.8471,44369.0508,0.4358,0.582,0.5811
1,29746.0462,1492780784.9462,38636.5214,0.6301,0.4741,0.4443
2,36849.6188,2639750114.3986,51378.4986,0.5548,0.5307,0.5624
3,39463.638,6710816121.423,81919.5711,0.3631,0.5072,0.3763
4,23116.7977,1086796686.9747,32966.5996,0.6166,0.3201,0.2447
Mean,32348.2342,2779751275.3179,49854.0483,0.5201,0.4828,0.4418
Std,5708.4018,2032397858.1729,17155.907,0.1043,0.0886,0.1242


In [276]:
ridge = create_model('ridge')

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,32862.16,1995718996.9107,44673.4708,0.428,0.5847,0.579
1,29798.4697,1526174761.444,39066.2868,0.6218,0.4771,0.4439
2,37167.6973,2696723301.6483,51929.9846,0.5452,0.5294,0.557
3,39586.2281,6699851594.4923,81852.6212,0.3641,0.5166,0.3837
4,23704.5006,1153279748.3086,33959.9727,0.5931,0.3311,0.2502
Mean,32623.8111,2814349680.5608,50296.4672,0.5105,0.4878,0.4428
Std,5598.4929,2009958468.5495,16870.5384,0.0987,0.0856,0.1202


In [277]:
gbr = create_model('gbr')

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,35581.4642,2158318118.1318,46457.7025,0.3814,0.6248,0.7177
1,28965.4644,1576578566.4071,39706.1528,0.6093,0.4591,0.431
2,36072.7452,2683507269.6484,51802.5798,0.5474,0.4973,0.4818
3,42100.1298,7010376297.1491,83727.9899,0.3346,0.4774,0.3442
4,24865.0002,1285490586.9943,35853.7388,0.5465,0.3454,0.2619
Mean,33516.9608,2942854167.6662,51509.6327,0.4839,0.4808,0.4473
Std,6000.6256,2090026648.5142,17017.9876,0.1063,0.0892,0.1546


We can now tune these models to try to obtain an even better MAE score

In [278]:
tuned_br = tune_model(br, n_iter=1000, optimize='MAE')

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,32056.715,1944605371.2933,44097.6799,0.4427,0.5814,0.5708
1,30105.8921,1511059085.5815,38872.3435,0.6256,0.4829,0.4576
2,36908.8465,2643853262.4859,51418.4137,0.5541,0.5351,0.5729
3,38898.9867,6623540974.7437,81385.1398,0.3713,0.4927,0.3686
4,23383.2279,1097857077.5814,33133.9264,0.6127,0.3169,0.2496
Mean,32270.7336,2764183154.3371,49781.5006,0.5213,0.4818,0.4439
Std,5461.9764,1996386352.2393,16911.1014,0.099,0.0895,0.1236


In [279]:
tuned_ridge = tune_model(ridge, n_iter=1000, optimize='MAE')

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,31587.0178,1909157943.423,43693.912,0.4528,0.5832,0.5894
1,30194.7288,1492823619.2874,38637.0757,0.6301,0.4789,0.459
2,35994.823,2562030659.3737,50616.5058,0.5679,0.5553,0.6175
3,38959.3754,6661362304.2719,81617.1692,0.3677,0.4907,0.3664
4,22629.8814,977053725.0808,31257.8586,0.6553,0.3001,0.2492
Mean,31873.1653,2720485650.2874,49164.5043,0.5348,0.4816,0.4563
Std,5577.4504,2037674084.5474,17416.5775,0.1089,0.0988,0.1376


In [282]:
# tuned_gbr = tune_model(gbr, n_iter=1000, optimize = 'MAE')

We can blend different models to try to achieve higher scores

In [283]:
blended_model = blend_models(estimator_list = [tuned_br, tuned_ridge, gbr], fold = 10, optimize = 'MAE', choose_better = True)

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,26137.3168,1046946680.4249,32356.5554,0.6879,0.5384,0.5552
1,34020.1207,2413491385.5993,49127.2978,0.3026,0.5843,0.5814
2,29287.1149,1436240274.3631,37897.7608,0.6471,0.3392,0.2979
3,28524.1692,1493098884.7556,38640.6377,0.5962,0.5629,0.5727
4,35850.6407,3125872992.3255,55909.5072,0.5427,0.5989,0.6789
5,35171.7037,1851365259.7014,43027.4942,0.6279,0.371,0.3485
6,33160.083,4533373003.9002,67330.3275,0.4371,0.5054,0.3939
7,44062.2586,8589197599.5486,92677.924,0.3286,0.4161,0.2653
8,24125.7786,1174599308.5976,34272.4278,0.3724,0.2953,0.2315
9,21856.6622,974751156.2399,31221.005,0.7428,0.3316,0.2571


We achieved 31219.5848 MAE score, higher than the highest MAE score for the tuned_ridge model (31873.1653).<br>
Let's try to tune this blended_model

In [284]:
tune_blended_model = tune_model(blended_model, n_iter = 1000, optimize = 'MAE', choose_better = True)

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,31587.0178,1909157943.423,43693.912,0.4528,0.5832,0.5894
1,30194.7288,1492823619.2874,38637.0757,0.6301,0.4789,0.459
2,35994.823,2562030659.3737,50616.5058,0.5679,0.5553,0.6175
3,38959.3754,6661362304.2719,81617.1692,0.3677,0.4907,0.3664
4,22629.8814,977053725.0808,31257.8586,0.6553,0.3001,0.2492
Mean,31873.1653,2720485650.2874,49164.5043,0.5348,0.4816,0.4563
Std,5577.4504,2037674084.5474,17416.5775,0.1089,0.0988,0.1376


No progress made.<br><br>
### *Best Model*: Blended_Model <br>
### *MAE Score*: 31219.5848

In [286]:
evaluate_model(blended_model)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Hyperparameters', 'param…

The 5 most important features to predict salaray in usd were:<br><br>
- Employee Residence <br><br>
- Company Location in India <br><br>
- Job Title: Data Analyst <br><br>
- Company Location in Japan <br><br>
- Job Title: Principal Data Engineer

In [287]:
# ytest data
ytest = get_config('y_test')
ytest

278     20171
274     77684
219    140000
352    167000
235    110000
        ...  
177     40038
105     51519
158    120000
279     59102
79      80000
Name: salary_in_usd, Length: 146, dtype: int64

In [288]:
# Predicting on validation sample
predict_model(blended_model)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Ridge Regression,30038.8795,2074915234.5033,45551.2375,0.5294,0.489,0.4749


Unnamed: 0,experience_level,employee_residence,company_size,work_year_2020,work_year_2021,work_year_2022,employment_type_Freelance,employment_type_Full-time,employment_type_Part-time,job_title_AI Scientist,...,company_location_PK,company_location_PL,company_location_RU,company_location_SI,company_location_TR,company_location_UA,company_location_US,company_location_VN,salary_in_usd,Label
0,2.0,-1.164929,2.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,20171,64126.925428
1,2.0,-1.080920,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,77684,76533.478669
2,2.0,0.960338,2.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,140000,170105.146098
3,2.0,0.960338,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,167000,157108.507666
4,1.0,0.960338,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,110000,119332.287918
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
141,1.0,-1.164929,2.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,40038,42059.713033
142,1.0,-0.910395,2.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,51519,52413.014940
143,2.0,0.960338,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,120000,163178.746388
144,0.0,-1.164929,2.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,59102,36912.139469


### *MAE Score on hold-out sample*: 30038.8795

In [289]:
# Finalizing model
final_blended_model = finalize_model(blended_model)
print(final_blended_model)

PowerTransformedTargetRegressor(alpha=4.1, copy_X=True, fit_intercept=False,
                                max_iter=None, normalize=False,
                                power_transformer_method='box-cox',
                                power_transformer_standardize=True,
                                random_state=4588,
                                regressor=Ridge(alpha=4.1, copy_X=True,
                                                fit_intercept=False,
                                                max_iter=None, normalize=False,
                                                random_state=4588,
                                                solver='auto', tol=0.001),
                                solver='auto', tol=0.001)


Now, let's go and try to predict on unseen data

In [293]:
unseen_data.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
485,2022,Senior,Full-time,Machine Learning Engineer,120000,US,100,US,Medium
486,2022,Senior,Full-time,Data Scientist,230000,US,100,US,Medium
487,2022,Junior,Part-time,Data Scientist,100000,DZ,50,DZ,Medium
488,2022,Intermediate,Freelance,Data Scientist,100000,CA,100,US,Medium
489,2022,Junior,Contract,Applied Machine Learning Scientist,31875,TN,100,CZ,Medium


In [294]:
unseen_predictions = predict_model(final_blended_model, data = unseen_data)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Ridge Regression,34387.6769,2535471652.3698,50353.4671,0.3864,0.4815,0.3435


In [296]:
unseen_predictions[['salary_in_usd','Label']].head(10).round(0)

Unnamed: 0,salary_in_usd,Label
485,120000,175020.0
486,230000,153974.0
487,100000,16540.0
488,100000,40043.0
489,31875,58323.0
490,200000,91558.0
491,75000,77383.0
492,35590,46667.0
493,78791,107482.0
494,100000,64569.0


### *MAE Score on Unseen Data*: 34387.6769 

Thank you so much for reading. Please, feel free to leave comments and suggestions, and if you liked this notebook, upvote it!<br><br>

*Luís Fernando Torres*