Skip to content

min-tee/Salary-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Salary-Prediction

Define

The objective of Salary Prediction project is to create predictive model to estimate salaries for future job applicants based on given set of features. The project code is available at this link,Salary Prediction.

Discover

Tools Used : Google Colab

Packages : pandas, numpy, matplotlib, seaborn, sklearn

Directories : datasets - test, train data used in the project, images - Graphs of EDA and models

Data

train_features.csv: Each row represents an observation. There are 1 milion records and 9 features and the "jobId" column is unique to each observation.

train-salaries.csv: Each row is a unique with jobID and its corresponding salary. The file is combined with train_features.csv to train the machine learning models.

test_features.csv: Similar to train_features.csv except it is missing salary which will be predicted by the models.

Features

jobId : Unique identifier for each job posting.

companyId : Unique identifier for each company posting the job position.

jobType : Type of job position. It contains 8 different categories - CEO, CFO, CTO, Janitor, Junior, Manager, Senior, Vice President.

industry : Job field. It contains 7 different categories - Health, Web, Auto, Finance, Education, Oil, Service.

degree : Highest eduction obtained. It consists 5 different categories - None, High School, Bachelor's, Master's, Doctoral.

major : Degree major. It contains 9 unique categories - None, Literature, Biology, Chemistry, Physics, Computer Science, Math, Business, Engineering.

yearsExperience : Experience in years.

milesFromMetropolis : Distance from the Metropolitan city, in miles.

salary : Target variable. Salary paid in thousands US dollars.

EDA

Salary Distribution images

Salary vs Features

degree

images

industry

images

jobType

images

major

images

milesFromMetropolis

images

yearsExperience

images

Correlation Matrix

images

EDA shows that features - job type, degree, major, industry, years of experience impact salary positively. Miles from metrolpolis is negatively correlated and company is not correlated.

Devlop

Baseline Model

Developed a simple baseline model with Mean Squared Error(MSE) between salary and features, jobType and industry.

MSE with industry : 1367.12

MSE with jobType : 963.93

Goal is to develop models with reduced MSE values.

Models Linear Regression

images

Random Forest

images

Gradient Boosting

images

Model Evaluation

Below table shows the MSE and R-squared for each model.

images

Deploy

Gradient Boosting Regressor have the lowest MSE and highest R-sqaured value, it is thus selected for the deployment for test set.

Predicted salary is being saved as predicted_salary.csv.

images

Feature Importance

Below bargraph shows the important features in descending order. The years of experience is the factor that contributes most for prediction of salary for the given post. Other features such as distance from the metropolis, job type and degree are also important.

images

Releases

No releases published

Packages

No packages published