# Salary Prediction

This notebook presents a method for predicting salaries based on features like age, years of experience, and department, among others. We'll train two models, XGBoost and LightGBM, and create an ensemble by averaging their predictions.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import StackingRegressor
from sklearn.linear_model import LinearRegression
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
import numpy as np



## Data Loading

First, let's load the training and test datasets.

In [2]:
# Load the datasets
train_data = pd.read_csv('/kaggle/input/thapar-summer-school-employee-salary-prediction/train.csv')
test_data = pd.read_csv('/kaggle/input/thapar-summer-school-employee-salary-prediction/test.csv')


## Data Preprocessing

We split the training data into training and validation sets. We then separate the target variable (salary) from the features. Categorical variables are one-hot encoded.


In [3]:
# Split the training data into training and validation sets
train_data, validation_data = train_test_split(train_data, test_size=0.2, random_state=42)

# Separate the target variable (salary) from the features
train_features = train_data.drop('salary', axis=1)
train_target = train_data['salary']
validation_features = validation_data.drop('salary', axis=1)
validation_target = validation_data['salary']

# One-hot encode the categorical variables
encoder = OneHotEncoder(drop='first', sparse=False)
encoder.fit(train_features[['company', 'department']])

train_encoded_features = encoder.transform(train_features[['company', 'department']])
validation_encoded_features = encoder.transform(validation_features[['company', 'department']])
test_encoded_features = encoder.transform(test_data[['company', 'department']])

# Get the feature names
feature_names = encoder.categories_[0][1:].tolist() + encoder.categories_[1][1:].tolist()

# Convert the encoded features to DataFrames
train_encoded_df = pd.DataFrame(train_encoded_features, columns=feature_names)
validation_encoded_df = pd.DataFrame(validation_encoded_features, columns=feature_names)
test_encoded_df = pd.DataFrame(test_encoded_features, columns=feature_names)

# Reset the indices
train_features.reset_index(drop=True, inplace=True)
validation_features.reset_index(drop=True, inplace=True)
test_data.reset_index(drop=True, inplace=True)

# Add the encoded features to the original DataFrames
train_features = pd.concat([train_features.drop(['company', 'department'], axis=1), train_encoded_df], axis=1)
validation_features = pd.concat([validation_features.drop(['company', 'department'], axis=1), validation_encoded_df], axis=1)
test_features = pd.concat([test_data.drop(['company', 'department'], axis=1), test_encoded_df], axis=1)



In [4]:
# Initialize the base models
base_models = [
    ('xgb', XGBRegressor(random_state=42)),
    ('lgbm', LGBMRegressor(random_state=42))
]

In [5]:
# Initialize the meta-model
meta_model = LinearRegression()


In [6]:
# Initialize the stacking regressor
stacking_reg = StackingRegressor(estimators=base_models, final_estimator=meta_model, cv=5)


In [7]:
# Train the stacking regressor
stacking_reg.fit(train_features, train_target)

In [8]:
# Make predictions on the validation set
validation_pred_stacking = stacking_reg.predict(validation_features)

In [9]:
# Calculate the MAE of the ensemble predictions
mae_stacking = mean_absolute_error(validation_target, validation_pred_stacking)

print(f'MAE of ensemble: {mae_stacking}')

MAE of ensemble: 11732.863309331533


In [10]:
# Make predictions on the test set
test_pred_stacking = stacking_reg.predict(test_features)

In [11]:
# Prepare the submission file
submission = pd.DataFrame({'id': test_data['id'], 'salary': test_pred_stacking})


In [12]:
# Save the submission file
submission.to_csv('/kaggle/working/submission.csv', index=False)