Job Salaries Estimator for Different Data Science Positions

Designed a tool that estimates job salaries to help data scientists negotiate their income when they are applying for different data science jobs.
Scraped the job descriptions from glassdoor.com using python and selenium
Developed Linear, Lasso, and Random Forest Regressors using GridsearchCV to get the best model.
Deployed the Machine Learning model in Heroku using flask

Links and Resources Used

Web Scraper Article: https://towardsdatascience.com/selenium-tutorial-scraping-glassdoor-com-in-10-minutes-3d0915c6d905
Web Scraper Github: https://github.com/arapfaik/scraping-glassdoor-selenium
Model Deployment Video: https://www.youtube.com/watch?v=mrExsjcvF4o&feature=youtu.be
Model Deployment Github: https://github.com/krishnaik06/Heroku-Demo
Packages: pandas, numpy, sklearn, matplotlib, seaborn, selenium, flask, json, pickle

Web Scraping

Used the web scraper github repo (above) to scrape the job postings from glassdoor.com. With each job, we obatin the following:

Job Title
Salary Estimate
Job Description
Rating
Company Name
Location
Headquarters
Size
Founded
Type of Ownernship
Industry
Sector
Revenue
Competitors

Data Cleaning

After scraping the data I needed to clean it so that it can be usable for our model. I made the following modifications and created the following variables:

Parsed only the numeric data out of Salary
Made seperate columns for employer provided salary and hourly wages
Salary column contained few empty values so removed the rows without Salary
Parsed rating out of company text
Made a seperate column for the Company State
Made a new column to check if the job is at the company’s headquarters
Added a new column age of company by using the founded date
Added columns to check if the different skills were listed in the job description:
Made a new column for simplified job title and Seniority
Made a new column for description length

Python
R
Excel
AWS
Apache Spark

Exploratory Data Analysis

EDA plays a very important role at this stage as the summarization of clean data helps in identifying the structure, outliers, anomalies, and patterns in data. I looked at the distributions of the data and the value counts for the various categorical variables. Have done the univariate, bivariate analysis, and plotted histograms,boxplots,bar graphs,pivot tables etc. to represent the data.

Model Building

First, I modified all the categorical variables into dummy variables. Then I splited the data into training and test sets with a test size of 20%. I tried three different models and evaluated them using Mean Absolute Error. I chose MAE because it is kind off easy to interpret and outliers aren’t particularly bad in for this type of model.

Multiple Linear Regression: Base Model
Lasso Regression: As there are any 0s and 1s(because of the sparse data from the many categorical variables), I have chosen a normalized regression like Lasso and thought it would be effective.
Random Forest Regression – With the sparsity associated with the data, I thought that this would be a good fit for our data.

Model Performance

The Random Forest model perfored better than the other models on the test set.

Linear Regression MAE: 18.885
Lasso Regression MAE: 19.665
Random Forest Regression MAE: 11.142

Model Deployment

I have deployed the model using the flask framework in Heroku which is a Platform As A Service(PAAS)

Web application: https://glassdoorsalaryprediction-api.herokuapp.com/

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
images		images
templates		templates
EDA.ipynb		EDA.ipynb
Glassdoor Salaries.csv		Glassdoor Salaries.csv
Procfile		Procfile
README.md		README.md
Report.html		Report.html
Report1.html		Report1.html
app.py		app.py
data_cleaning.ipynb		data_cleaning.ipynb
data_collection.py		data_collection.py
eda_data.csv		eda_data.csv
glassdoor_jobs.csv		glassdoor_jobs.csv
glassdoor_scraper.py		glassdoor_scraper.py
model.py		model.py
model1.pkl		model1.pkl
model2.pkl		model2.pkl
modelbuilding.py		modelbuilding.py
output.html		output.html
pandasprofiling.ipynb		pandasprofiling.ipynb
requirements.txt		requirements.txt
salary_data_cleaned.csv		salary_data_cleaned.csv
sweetviz.ipynb		sweetviz.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Job Salaries Estimator for Different Data Science Positions

Links and Resources Used

Web Scraping

Data Cleaning

Exploratory Data Analysis

Model Building

Model Performance

Model Deployment

About

Releases

Packages

Contributors 2

Languages

mathangpeddi/Glassdoor-Job-Salaries

Folders and files

Latest commit

History

Repository files navigation

Job Salaries Estimator for Different Data Science Positions

Links and Resources Used

Web Scraping

Data Cleaning

Exploratory Data Analysis

Model Building

Model Performance

Model Deployment

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages