Skip to content

max-lutz/salary_ml_project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Salary machine learning project

This project presents an end-to-end machine learning solution to problem of predicting salary from job offers posted on linkedin.

You can access the online app to monitor the model hosted on AWS: https://salary-ml-project.streamlit.app/

This projects consists of the following steps:

  • Problem definition: Predicting the salary of job offers on linkedin
  • Data collection: Web-scraping script on the linkedin website [python]
  • Data preprocessing and feature engineering: Extracting numbers from the job description to estimate the salary [pandas]
  • Model selection, training, and fine-tuning: Selecting base models with good performances and fast training out of 15 model classes. Training more than 100 models using bayesian optimization to fine-tune hyperparameters. [scikit-learn, mlflow]
  • Model evaluation: Mlflow for comparing model performances [mlflow]
  • Model deployment: Local deployment using mlflow and online deployment on AWS using fastapi [mlflow, fastapi]
  • Model monitoring: Local monitoring on data quality, data drift and model performances. Online monitoring on API latency, and predictions errors. [evidently]

salary_ml

How to use this repository:

  1. Clone the repository
  2. Run
    pip install -r requirements.txt
    
  3. Train a new model:
    python src/train/train_model.py --in data/linkedin_jobs.csv --n_eval 25
    
  4. Run mlflow server to select the best model:
    mlflow server --host 127.0.0.1 --port 8080
    
  5. Serve the model locally (find the mlflow model you want to serve):
    mlflow models serve -m {path to mlflow model} -h 127.0.0.1 -p 8001 --env-manager=local
    
  6. Monitor the model:
    python src/monitor/monitor_local_model.py --train_file data/train.zip --test_file data/test.zip
    

Instructions for end-to-end pipeline

The full pipeline can take some time to run, especially the data generation part, depending on --n_queries:

  1. Generate data:
    python src/data/generate_data.py --queries "data analyst, data scientist, data engineer" --out data/train.h5 --n_queries 10
    
  2. Prepare the data:
    python src/features/build_dataset.py --in data/train.h5 --out data/train.zip
    
  3. Train a new model:
    python src/train/train_model.py --in data/linkedin_jobs.csv --n_eval 25
    
  4. Run mlflow server to select the best model:
    mlflow server --host 127.0.0.1 --port 8080
    
  5. Serve the model locally (find the mlflow model you want to serve):
    mlflow models serve -m {path to mlflow model} -h 127.0.0.1 -p 8001 --env-manager=local
    
  6. Monitor the model:
    python src/monitor/monitor_local_model.py --train_file data/train.zip --test_file data/test.zip
    

About

This project presents an end-to-end machine learning solution to problem of predicting salary from job offers posted on linkedin.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages