The objective of Salary Prediction project is to create predictive model to estimate salaries for future job applicants based on given set of features. The project code is available at this link,Salary Prediction.
Tools Used : Google Colab
Packages : pandas, numpy, matplotlib, seaborn, sklearn
Directories : datasets - test, train data used in the project, images - Graphs of EDA and models
Data
train_features.csv: Each row represents an observation. There are 1 milion records and 9 features and the "jobId" column is unique to each observation.
train-salaries.csv: Each row is a unique with jobID and its corresponding salary. The file is combined with train_features.csv to train the machine learning models.
test_features.csv: Similar to train_features.csv except it is missing salary which will be predicted by the models.
Features
jobId : Unique identifier for each job posting.
companyId : Unique identifier for each company posting the job position.
jobType : Type of job position. It contains 8 different categories - CEO, CFO, CTO, Janitor, Junior, Manager, Senior, Vice President.
industry : Job field. It contains 7 different categories - Health, Web, Auto, Finance, Education, Oil, Service.
degree : Highest eduction obtained. It consists 5 different categories - None, High School, Bachelor's, Master's, Doctoral.
major : Degree major. It contains 9 unique categories - None, Literature, Biology, Chemistry, Physics, Computer Science, Math, Business, Engineering.
yearsExperience : Experience in years.
milesFromMetropolis : Distance from the Metropolitan city, in miles.
salary : Target variable. Salary paid in thousands US dollars.
EDA
Salary vs Features
degree
industry
jobType
major
milesFromMetropolis
yearsExperience
Correlation Matrix
EDA shows that features - job type, degree, major, industry, years of experience impact salary positively. Miles from metrolpolis is negatively correlated and company is not correlated.
Baseline Model
Developed a simple baseline model with Mean Squared Error(MSE) between salary and features, jobType and industry.
MSE with industry : 1367.12
MSE with jobType : 963.93
Goal is to develop models with reduced MSE values.
Models Linear Regression
Random Forest
Gradient Boosting
Model Evaluation
Below table shows the MSE and R-squared for each model.
Gradient Boosting Regressor have the lowest MSE and highest R-sqaured value, it is thus selected for the deployment for test set.
Predicted salary is being saved as predicted_salary.csv.
Feature Importance
Below bargraph shows the important features in descending order. The years of experience is the factor that contributes most for prediction of salary for the given post. Other features such as distance from the metropolis, job type and degree are also important.