Salary machine learning project

This project presents an end-to-end machine learning solution to problem of predicting salary from job offers posted on linkedin.

You can access the online app to monitor the model hosted on AWS: https://salary-ml-project.streamlit.app/

This projects consists of the following steps:

Problem definition: Predicting the salary of job offers on linkedin
Data collection: Web-scraping script on the linkedin website [python]
Data preprocessing and feature engineering: Extracting numbers from the job description to estimate the salary [pandas]
Model selection, training, and fine-tuning: Selecting base models with good performances and fast training out of 15 model classes. Training more than 100 models using bayesian optimization to fine-tune hyperparameters. [scikit-learn, mlflow]
Model evaluation: Mlflow for comparing model performances [mlflow]
Model deployment: Local deployment using mlflow and online deployment on AWS using fastapi [mlflow, fastapi]
Model monitoring: Local monitoring on data quality, data drift and model performances. Online monitoring on API latency, and predictions errors. [evidently]

How to use this repository:

Clone the repository
Run
```
pip install -r requirements.txt
```

Train a new model:

python src/train/train_model.py --in data/linkedin_jobs.csv --n_eval 25

Run mlflow server to select the best model:

mlflow server --host 127.0.0.1 --port 8080

Serve the model locally (find the mlflow model you want to serve):

mlflow models serve -m {path to mlflow model} -h 127.0.0.1 -p 8001 --env-manager=local

Monitor the model:

python src/monitor/monitor_local_model.py --train_file data/train.zip --test_file data/test.zip

Instructions for end-to-end pipeline

The full pipeline can take some time to run, especially the data generation part, depending on --n_queries:

Generate data:

python src/data/generate_data.py --queries "data analyst, data scientist, data engineer" --out data/train.h5 --n_queries 10

Prepare the data:

python src/features/build_dataset.py --in data/train.h5 --out data/train.zip

Train a new model:

python src/train/train_model.py --in data/linkedin_jobs.csv --n_eval 25

Run mlflow server to select the best model:

mlflow server --host 127.0.0.1 --port 8080

Serve the model locally (find the mlflow model you want to serve):

mlflow models serve -m {path to mlflow model} -h 127.0.0.1 -p 8001 --env-manager=local

Monitor the model:

python src/monitor/monitor_local_model.py --train_file data/train.zip --test_file data/test.zip

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
app		app
aws_api		aws_api
data		data
models		models
notebooks		notebooks
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
chromedriver		chromedriver
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

app

app

aws_api

aws_api

data

data

models

models

notebooks

notebooks

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

chromedriver

chromedriver

requirements.txt

requirements.txt

Repository files navigation

Salary machine learning project

How to use this repository:

Instructions for end-to-end pipeline

About

Releases

Packages

Languages

License

max-lutz/salary_ml_project

Folders and files

Latest commit

History

Repository files navigation

Salary machine learning project

How to use this repository:

Instructions for end-to-end pipeline

About

Resources

License

Stars

Watchers

Forks

Languages