Salary machine learning project

This project presents an end-to-end machine learning solution to problem of predicting salary from job offers posted on linkedin.

You can access the online app to monitor the model hosted on AWS: https://salary-ml-project.streamlit.app/

This projects consists of the following steps:

Problem definition: Predicting the salary of job offers on linkedin
Data collection: Web-scraping script on the linkedin website [python]
Data preprocessing and feature engineering: Extracting numbers from the job description to estimate the salary [pandas]
Model selection, training, and fine-tuning: Selecting base models with good performances and fast training out of 15 model classes. Training more than 100 models using bayesian optimization to fine-tune hyperparameters. [scikit-learn, mlflow]
Model evaluation: Mlflow for comparing model performances [mlflow]
Model deployment: Local deployment using mlflow and online deployment on AWS using fastapi [mlflow, fastapi]
Model monitoring: Local monitoring on data quality, data drift and model performances. Online monitoring on API latency, and predictions errors. [evidently]

How to use this repository:

Train a new model:

python src/train/train_model.py --in data/linkedin_jobs.csv --n_eval 25

Run mlflow server to select the best model:

mlflow server --host 127.0.0.1 --port 8080

Serve the model locally (find the mlflow model you want to serve):

mlflow models serve -m {path to mlflow model} -h 127.0.0.1 -p 8001 --env-manager=local

Monitor the model:

python src/monitor/monitor_local_model.py --train_file data/train.zip --test_file data/test.zip

The full pipeline can take some time to run, especially the data generation part, depending on --n_queries:

Generate data:

python src/data/generate_data.py --queries "data analyst, data scientist, data engineer" --out data/train.h5 --n_queries 10

Prepare the data:

python src/features/build_dataset.py --in data/train.h5 --out data/train.zip

Train a new model:

python src/train/train_model.py --in data/linkedin_jobs.csv --n_eval 25

Run mlflow server to select the best model:

mlflow server --host 127.0.0.1 --port 8080

Serve the model locally (find the mlflow model you want to serve):

mlflow models serve -m {path to mlflow model} -h 127.0.0.1 -p 8001 --env-manager=local

Monitor the model:

python src/monitor/monitor_local_model.py --train_file data/train.zip --test_file data/test.zip