# Base Line Model

On thus notebook we build a baseline model that always predicts the mean salary

In [24]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyRegressor
from sklearn.metrics import mean_squared_error

# Get Data

In [3]:
data = pd.read_csv('derived_data/train_data_merged.csv')

print(data.shape)

(1000000, 9)


In [4]:
data.sample(5)

Unnamed: 0,jobId,salary,companyId,jobType,degree,major,industry,yearsExperience,milesFromMetropolis
643232,JOB1362685050919,93,COMP36,JUNIOR,HIGH_SCHOOL,NONE,WEB,1,12
683912,JOB1362685091599,118,COMP40,VICE_PRESIDENT,NONE,NONE,OIL,4,90
886427,JOB1362685294114,135,COMP2,VICE_PRESIDENT,BACHELORS,PHYSICS,FINANCE,4,46
665782,JOB1362685073469,96,COMP27,SENIOR,NONE,NONE,WEB,14,83
429793,JOB1362684837480,148,COMP39,CFO,BACHELORS,BIOLOGY,WEB,4,15


# Get Features, Target

In [10]:
features = data.drop('salary', axis=1)
target = data['salary']

print(features.shape, target.shape)

(1000000, 8) (1000000,)


# Split Data

In [18]:
X_train, X_test, y_train, y_test = train_test_split(features, target, \
                                                    test_size=0.2, random_state=777)

print(X_train.shape, X_test.shape)

(800000, 8) (200000, 8)


# Get Mean Salary

In [12]:
mean = target.mean()

print(mean)

116.061818


# Run Dummy Regressor

In [14]:
dummy_regr = DummyRegressor(strategy="mean")

In [19]:
#fit dummy regressor
dummy_regr.fit(X_train, y_train)

DummyRegressor()

In [26]:
#predict
y_pred = dummy_regr.predict(X_test)

# Score Dummy Regressor

In [22]:
dummy_regr.score(X_test, y_test)

-2.2124643013210488e-06

In [27]:
#get mse of dummy regressor
mean_squared_error(y_test, y_pred)

1501.7286608666768

Our results show that a model that will always predict the average salary will have a mean squared error of $1501.73

Our objective is therefore to develop a model that predicts salary with a mean squared error(MSE) that is significantly lower than $1501.73

# References

sklearn.dummy.DummyRegressor Retrieved September 03, 2020 from https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyRegressor.html