This project presents an end-to-end machine learning solution to problem of predicting salary from job offers posted on linkedin.
You can access the online app to monitor the model hosted on AWS: https://salary-ml-project.streamlit.app/
This projects consists of the following steps:
- Problem definition: Predicting the salary of job offers on linkedin
- Data collection: Web-scraping script on the linkedin website [python]
- Data preprocessing and feature engineering: Extracting numbers from the job description to estimate the salary [pandas]
- Model selection, training, and fine-tuning: Selecting base models with good performances and fast training out of 15 model classes. Training more than 100 models using bayesian optimization to fine-tune hyperparameters. [scikit-learn, mlflow]
- Model evaluation: Mlflow for comparing model performances [mlflow]
- Model deployment: Local deployment using mlflow and online deployment on AWS using fastapi [mlflow, fastapi]
- Model monitoring: Local monitoring on data quality, data drift and model performances. Online monitoring on API latency, and predictions errors. [evidently]
- Clone the repository
- Run
pip install -r requirements.txt
- Train a new model:
python src/train/train_model.py --in data/linkedin_jobs.csv --n_eval 25
- Run mlflow server to select the best model:
mlflow server --host 127.0.0.1 --port 8080
- Serve the model locally (find the mlflow model you want to serve):
mlflow models serve -m {path to mlflow model} -h 127.0.0.1 -p 8001 --env-manager=local
- Monitor the model:
python src/monitor/monitor_local_model.py --train_file data/train.zip --test_file data/test.zip
The full pipeline can take some time to run, especially the data generation part, depending on --n_queries:
- Generate data:
python src/data/generate_data.py --queries "data analyst, data scientist, data engineer" --out data/train.h5 --n_queries 10
- Prepare the data:
python src/features/build_dataset.py --in data/train.h5 --out data/train.zip
- Train a new model:
python src/train/train_model.py --in data/linkedin_jobs.csv --n_eval 25
- Run mlflow server to select the best model:
mlflow server --host 127.0.0.1 --port 8080
- Serve the model locally (find the mlflow model you want to serve):
mlflow models serve -m {path to mlflow model} -h 127.0.0.1 -p 8001 --env-manager=local
- Monitor the model:
python src/monitor/monitor_local_model.py --train_file data/train.zip --test_file data/test.zip