# MLOps

"MLOps helps drive business value by fast-tracking the experimentation process and development pipeline, improving the quality of model production—and makes it easier to monitor and maintain production models and manage regulatory requirements." - [Deloitte](https://www2.deloitte.com/us/en/insights/focus/tech-trends/2021/mlops-industrialized-ai.html)

"Relying on a few experts has limits, chiefly related to scalability and repeatability. Every data Jedi typically prefers their own set of model development and deployment workflows, based on education, experience, and personal preferences. They then often build models with bespoke data extracts that can require significant effort to recreate when later brought into a production setting using real-world, large-scale data. As machine learning permeates the enterprise, a more scalable, efficient, and faster approach is needed to improve development resilience, reduce production bottlenecks, and increase the reach of ML projects." - [Deloitte](https://www2.deloitte.com/us/en/insights/focus/tech-trends/2021/mlops-industrialized-ai.html)

## Purpose / Goals

### Reproducibility / Version control

Trivial to view the individual components (code + dependencies) of a release from a year ago and re-deploy that exact version into production

Re-construct a machine learning model (within a few percentage points of accuracy) from a year ago. This requires having traceability covering all of the inputs:

- the dataset that was used
- the version of the machine learning framework
- the code commit
- the dependencies/packages
- the driver version and low level libraries like CUDA and cuDNN
- the container or runtime
- the parameters used to train the model
- the device it was trained on
- some specific machine learning inputs such as the initialization of layer weights

### Collaboration

Begins with having a unified hub where all activity, lineage, and model performance is tracked.  This includes training runs, Jupyter notebooks, hyperparameter searches, visualizations, statistical metrics, datasets, code references, and a repository of model artifacts (often referred to as a model repository).  Layering in things like granular permissions for team members, audit logs, and tags is also essential. 

Span each phase of the model lifecycle, from concept and R+D through testing and QA, all the way to production

### Scalability

machine learning in practice requires a large (and sometimes massive) amount of computational power (and storage), and often requires esoteric infrastructure like purpose-built silicon (e.g. GPUs, TPUs, and the myriad of other chips entering the market).  Machine learning engineers need an infrastructure abstraction layer which makes it easy to schedule and scale workloads without needing years of experience in networking, Kubernetes, Docker, storage systems, etc.

On-demand compute and storage resources so they can iterate faster in the training, tuning, and inference phase.

Some examples of infrastructure automation:

- Multi-cloud: A machine learning platform should make it trivial to train a model on-premise and seamlessly deploy that model to the public cloud (or vice versa).
- Scaling workloads: As computational demands increase, training or tuning models that span multiple compute instances becomes essential. Plumbing shared storage volumes into a distributed fleet of containers running on heterogenous hardware and connected via a MPI or gRPC message bus is not something you want your machine learning engineers spending cycles on. 

### Process Automation / Speed

The software equivalent of an application that is automatically compiled, tested, and deployed. Once deployed, it is standard to rely on automated health and performance monitoring of deployed applications.

Enable automated testing of machine learning artifacts (e.g. data validation, ML model testing, and ML model integration testing)

### Reliability / Transparency

- Interpretability
  - Patterns / trends the model uses
  - Reasons for a specific prediction
- Ongoing model performance efficacy / model drift
- Identify / reduce / remove bias issues

## Data

- Pipeline
  - Features
    - Maintain existing
    - Test / prototype feature engineering
    - Add new
    - Deprecate old
  - Ground Truth
  - Flexibility in source
    - Ex: MSSQL vs Postgres
    - Reduce rows (ex: reduce to not NA for sparse rows)
  - Ingestion format flexibility
    - Tabular
    - json
    - Graph
  - Scalability
- Data Versioning
  - Save training data
- EDA
  - Visualization
  - Metrics
    - Corrleation with target
    - Missingness

## Modeling

- Process
  - Pull in data
  - Preprocess features
  - Train models and evaluate
  - Hyperparameter tuning
  - Versioning
    - Compare current model with new/refreshed version
  
- Experimentation

## Deployment

- Choose model(s) to deploy
- Package chosen model(s)
- Prediction feature data
  - Only use selected features for individual model (model will need to inform deployment pipeline what raw features it uses)
- Output predictions
  - Flexibility in consumer of predictions/output (ex: accounting dept, client services, etc)

## Monitoring

- Staleness/drift
  - Alerting
  - Kick off AutoRefresh
- Transparency
  - Trends/patterns the model finds/targets
  - Predictions
    - Value
    - Reason for prediction
  - Feature usefulness
- Prediction missingness
  - Claims without predictions
  - Prediction lag / Outages
  - Alerting

## Resources

- [Machine Learning Engineering for Production (MLOps) Specialization on Coursera](https://www.coursera.org/specializations/machine-learning-engineering-for-production-mlops#courses)
- [MLOps (Machine Learning Operations) Fundamentals](https://www.coursera.org/learn/mlops-fundamentals)
- [MLOps: How to choose the best ML model tools](https://www.ambiata.com/blog/2020-12-07-mlops-tools/)
- [What is MLOps?](https://blog.paperspace.com/what-is-mlops/)
- [CI/CD for Machine Learning & AI](https://blog.paperspace.com/ci-cd-for-machine-learning-ai/)
- [MLOps: Industrialized AI](https://www2.deloitte.com/us/en/insights/focus/tech-trends/2021/mlops-industrialized-ai.html)
- [MLOps Tooling Landscape v2](https://huyenchip.com/2020/12/30/mlops-v2.html)
- [MLOps Roadmap 2020](https://github.com/cdfoundation/sig-mlops/blob/master/roadmap/2020/MLOpsRoadmap2020.md)
- [mlops.org](https://ml-ops.org/#gettingstarted)
- [mlops.org github pg](https://github.com/visenger/awesome-mlops)
- [MLOps examples from Microsoft Open Source](https://github.com/microsoft/MLOps)
- [Machine Learning Opeations at run.ai](https://www.run.ai/guides/machine-learning-engineering/machine-learning-operations/)
- [What Is MLOps? Machine Learning Operations Explained from bmc blog](https://www.bmc.com/blogs/mlops-machine-learning-ops/)
- [The MLOps Toolkit](https://testdriven.io/blog/mlops/)
- [What is MLOps? Machine Learning Operations Explained from freeCodeCamp](https://www.freecodecamp.org/news/what-is-mlops-machine-learning-operations-explained/)
- [MLOps Explained by Arrikto](https://www.arrikto.com/mlops-explained/)
- [Google Cloud Architecture Center: MLOps: Continuous delivery and automation pipelines in machine learning](https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning)
- [BLOps tag on the neptune blog](https://neptune.ai/blog/category/mlops)
- [AWS MLOps Framework](https://aws.amazon.com/solutions/implementations/aws-mlops-framework/)