# Multiple Linear Regression

In this experiment, we apply *multiple linear regression* to the following set of ***normalized*** predictors

- AI or ML Job
- Experience Level
- Work Year
- Company Size
- Same Country
- Remote Ratio
- GDP at Employee Residence

This experiment shows that GDP and experience level generate by far the highest coefficients and that the prediction works well with a reasonably good R2 score.

In [1]:
from src.preprocessing import complete_preprocessing
from sklearn.linear_model import LinearRegression

In [2]:
# load data

salaries = complete_preprocessing()
salaries = salaries[["ai_or_ml_job","salary_in_usd", "experience_level","work_year","company_size", "same_country","remote_ratio","gdp_employee_residence"]]

In [3]:
# normalize
salaries_norm = ((salaries - salaries.mean()) /salaries.std())
X = salaries_norm[["ai_or_ml_job", "experience_level","work_year","company_size", "same_country","remote_ratio","gdp_employee_residence"]]
y = salaries_norm["salary_in_usd"]
X = X.to_numpy()
y = y.to_numpy()

# calculate regression coefficients
reg = LinearRegression(fit_intercept=True).fit(X.reshape(-1,7), y)
r2_score = reg.score(X.reshape(-1,7), y)
print('Coefficients:', reg.coef_)
print('R2 Score:', r2_score)
print('Intercept:', reg.intercept_)


Coefficients: [ 0.052594    0.47384762  0.04081579  0.04483822 -0.09111529  0.05989227
  0.55202593]
R2 Score: 0.5233901918678077
Intercept: -8.675898081979057e-15
