# Data Loading & Exploratory Data Analysis (EDA)

## Objective
# Salary Prediction Using Machine Learning

The objective of this notebook is to build and evaluate machine learning models to predict salary in USD based on job-related features. Linear Regression and Random Forest models are trained and compared.


In [None]:
import pandas as pd

df = pd.read_csv("/ds_salaries.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,0,2020,MI,FT,Data Scientist,70000,EUR,79833,DE,0,DE,L
1,1,2020,SE,FT,Machine Learning Scientist,260000,USD,260000,JP,0,JP,S
2,2,2020,SE,FT,Big Data Engineer,85000,GBP,109024,GB,50,GB,M
3,3,2020,MI,FT,Product Data Analyst,20000,USD,20000,HN,0,HN,S
4,4,2020,SE,FT,Machine Learning Engineer,150000,USD,150000,US,50,US,L


Select Features (X) and Target (y)

In [None]:
X = df[
    [
        "experience_level",
        "employment_type",
        "job_title",
        "company_location",
        "company_size",
        "remote_ratio",
        "work_year"
    ]
]

y = df["salary_in_usd"]

Encode Categorical Variables

In [None]:
X_encoded = pd.get_dummies(X, drop_first=True)

Train–Test Split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_encoded,
    y,
    test_size=0.2,
    random_state=42
)

In [None]:
X_train.shape, X_test.shape

((485, 108), (122, 108))

## Model Comparison
Both Linear Regression and Random Forest models are evaluated using RMSE and R² score to determine the best-performing model.


Train First ML Model – Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)


Make Prediction

In [None]:
y_pred = model.predict(X_test)
y_pred[:5]

array([145384.99341377, 145384.99341377,  76951.73584284, 191981.30905236,
       133114.82880717])

Evaluate Model Performance

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

rmse, r2

(np.float64(46803.16257427774), 0.428441881664966)

**Random Forest Regression**

Train a Random Forest Model

In [None]:
from sklearn.ensemble import RandomForestRegressor

rf_model = RandomForestRegressor(
    n_estimators=200,
    random_state=42,
    n_jobs=-1
)

rf_model.fit(X_train, y_train)

Make Predictions

In [None]:
y_pred_rf = rf_model.predict(X_test)
y_pred_rf[:5]

array([141523.82597463, 141523.82597463, 112083.12019841, 128831.43619048,
        94968.58450397])

Evaluate Random Forest Performance

In [None]:
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
r2_rf = r2_score(y_test, y_pred_rf)

rmse_rf, r2_rf

(np.float64(48939.55107577376), 0.3750720341501509)

***shows what affects salary most***

In [None]:
import pandas as pd

feature_importance = pd.Series(
    rf_model.feature_importances_,
    index=X_encoded.columns
).sort_values(ascending=False)

feature_importance.head(10)

Unnamed: 0,0
company_location_US,0.295973
experience_level_EX,0.082788
job_title_Principal Data Engineer,0.046506
experience_level_SE,0.044162
remote_ratio,0.041175
job_title_Data Analyst,0.040045
work_year,0.036793
company_size_S,0.035846
job_title_Financial Data Analyst,0.030652
job_title_Machine Learning Scientist,0.030568


RMSE-(Lower is Better)

R² Score-(Higher is Better)

## Final Model Selection
Linear Regression was selected as the final model as it achieved lower RMSE and higher R² score, indicating better generalization performance on unseen data.
