Salary-Prediction

Define

The objective of Salary Prediction project is to create predictive model to estimate salaries for future job applicants based on given set of features. The project code is available at this link,Salary Prediction.

Discover

Tools Used : Google Colab

Packages : pandas, numpy, matplotlib, seaborn, sklearn

Directories : datasets - test, train data used in the project, images - Graphs of EDA and models

Data

train_features.csv: Each row represents an observation. There are 1 milion records and 9 features and the "jobId" column is unique to each observation.

train-salaries.csv: Each row is a unique with jobID and its corresponding salary. The file is combined with train_features.csv to train the machine learning models.

test_features.csv: Similar to train_features.csv except it is missing salary which will be predicted by the models.

Features

jobId : Unique identifier for each job posting.

companyId : Unique identifier for each company posting the job position.

jobType : Type of job position. It contains 8 different categories - CEO, CFO, CTO, Janitor, Junior, Manager, Senior, Vice President.

industry : Job field. It contains 7 different categories - Health, Web, Auto, Finance, Education, Oil, Service.

degree : Highest eduction obtained. It consists 5 different categories - None, High School, Bachelor's, Master's, Doctoral.

major : Degree major. It contains 9 unique categories - None, Literature, Biology, Chemistry, Physics, Computer Science, Math, Business, Engineering.

yearsExperience : Experience in years.

milesFromMetropolis : Distance from the Metropolitan city, in miles.

salary : Target variable. Salary paid in thousands US dollars.

EDA

Salary Distribution

Salary vs Features

degree

industry

jobType

major

milesFromMetropolis

yearsExperience

Correlation Matrix

EDA shows that features - job type, degree, major, industry, years of experience impact salary positively. Miles from metrolpolis is negatively correlated and company is not correlated.

Devlop

Baseline Model

Developed a simple baseline model with Mean Squared Error(MSE) between salary and features, jobType and industry.

MSE with industry : 1367.12

MSE with jobType : 963.93

Goal is to develop models with reduced MSE values.

Models Linear Regression

Random Forest

Gradient Boosting

Model Evaluation

Below table shows the MSE and R-squared for each model.

Deploy

Gradient Boosting Regressor have the lowest MSE and highest R-sqaured value, it is thus selected for the deployment for test set.

Predicted salary is being saved as predicted_salary.csv.

Feature Importance

Below bargraph shows the important features in descending order. The years of experience is the factor that contributes most for prediction of salary for the given post. Other features such as distance from the metropolis, job type and degree are also important.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
datasets		datasets
images		images
README.md		README.md
Salary_Prediction.ipynb		Salary_Prediction.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Salary-Prediction

Define

Discover

Devlop

Deploy

About

Releases

Packages

Languages

min-tee/Salary-Prediction

Folders and files

Latest commit

History

Repository files navigation

Salary-Prediction

Define

Discover

Devlop

Deploy

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages