Dataset : Employee data used in lecture linked here. 1.Preprocess Test data and get predictions 2.Compute Mean Absolute Error, Mean Square error for test data. 3.Optional : Read about Ridge and Lasso Regression, implement them and compute evaluation metrics. Do they perform better than Linear Regression?

In [2]:
import pandas as pd

# Load the dataset
data = pd.read_csv('/content/employee.csv')

# Display the first few rows of the dataset
print(data.head())

# Display a summary of the dataset
print(data.info())

# Check for missing values
print(data.isnull().sum())


   id            timestamp        country employment_status  job_title  \
0   1  12/11/2018 10:52:26       Slovenia         Full time  Developer   
1   2    1/5/2017 16:57:50  United States         Full time        DBA   
2   3   12/18/2017 8:13:15         Sweden         Full time        DBA   
3   4   12/27/2018 4:56:52  United States         Full time        DBA   
4   5  12/11/2018 14:07:58  United States         Full time  Developer   

   job_years is_manager  hours_per_week  telecommute_days_per_week  \
0    4.78393        Yes            40.0                        0.0   
1    5.00000         No            40.0                        5.0   
2    1.00000         No            40.0                        0.0   
3    1.00000         No            40.0                        2.0   
4    3.00000         No            40.0                        2.0   

             education is_education_computer_related certifications  \
0  Bachelors (4 years)                           Yes           

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Drop 'id' and 'timestamp' as they are not useful for prediction
data = data.drop(['id', 'timestamp'], axis=1)

# Simple Imputer for Numerical Columns
num_imputer = SimpleImputer(strategy='mean')

# Categorical Columns to be One-Hot Encoded
cat_cols = ['country', 'employment_status', 'job_title', 'is_manager', 'education', 'is_education_computer_related', 'certifications']
cat_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='Missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_imputer, ['job_years', 'hours_per_week', 'telecommute_days_per_week']),
        ('cat', cat_transformer, cat_cols)
    ])

# Splitting the data
X = data.drop('salary', axis=1)
y = data['salary']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply preprocessing
X_train_preprocessed = preprocessor.fit_transform(X_train)
X_test_preprocessed = preprocessor.transform(X_test)


In [4]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Linear Regression
lr = LinearRegression()
lr.fit(X_train_preprocessed, y_train)

# Predicting and Evaluating
y_pred = lr.predict(X_test_preprocessed)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

mae, mse


(869.1470474400523, 1344813.8488348853)

In [5]:
from sklearn.linear_model import Ridge, Lasso

# Ridge Regression
ridge = Ridge(alpha=1.0)
ridge.fit(X_train_preprocessed, y_train)
y_pred_ridge = ridge.predict(X_test_preprocessed)
mae_ridge = mean_absolute_error(y_test, y_pred_ridge)
mse_ridge = mean_squared_error(y_test, y_pred_ridge)

# Lasso Regression
lasso = Lasso(alpha=1.0)
lasso.fit(X_train_preprocessed, y_train)
y_pred_lasso = lasso.predict(X_test_preprocessed)
mae_lasso = mean_absolute_error(y_test, y_pred_lasso)
mse_lasso = mean_squared_error(y_test, y_pred_lasso)

(mae_ridge, mse_ridge, mae_lasso, mse_lasso)


(866.5236230326344, 1340609.9402750763, 862.9723307338778, 1341679.2063698836)

Both Ridge and Lasso Regression models show slight improvements in terms of MAE and MSE over the basic Linear Regression model:

Lasso Regression offers the lowest MAE, suggesting it is slightly better in terms of average prediction error.
Ridge Regression shows a modest improvement in MSE over Linear Regression, suggesting it manages the error squared terms better, potentially due to its regularization effect which penalizes large coefficients.
This analysis indicates that for this dataset, regularization techniques such as Ridge and Lasso can offer better performance by managing multicollinearity and preventing overfitting, compared to plain Linear Regression.


