d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

# Production Issues

Deploying machine learning models is complex.  While some devops best practices from traditional software development apply, there are a number of additional concerns.  This lesson explores the various production issues seen in deploying and monitoring machine learning models.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lesson you:<br>
 - Introduce the primary concerns of production environments: reliability, scalability, and maintainability
 - Explore how model deployment differs from conventional software development deployment
 - Compare and contrast deployment architectures
 - Explore Continuous Integration and Continuous Deployment (CI/CD) for machine learning

###![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) From Data Science to Data Engineering

Data Science != Data Engineering<br><br>

* Data science and data engineering are two related but distinct practices
* Data scientists generally concern themselves with deriving business insights from data.  
  - They look to turn business problems into data problems, model those data problems, and optimize model performance.  
* Data engineers are generally concerned with a host of different production issues:  
  - **Reliability:** The system should work correctly, even when faced with hardware and software failures or human error
  - **Scalability:** The system should be able to deal with growing and shrinking volume of data
  - **Maintainability:** The system should be able to work with current demands and future feature updates

<a href="https://www.oreilly.com/library/view/designing-data-intensive-applications/9781491903063/" target="_blank">Source</a>

-sandbox
<div><img src="https://files.training.databricks.com/images/eLearning/ML-Part-4/ASUMDM.png" style="height: 450px; margin: 20px"/></div>

We need the left side of this equation to be "closed loop" (that is, fully automated)

<a href="ftp://ftp.software.ibm.com/software/data/sw-library/services/ASUM.pdf" target="_blank">Source</a>

#### Reliability: Working Correctly<br><br>

 - **Fault tolerance** 
   - Hardware failures
   - Software failures
   - Human errors
 - **Robustness** 
   - ad hoc queries
 - **Security**

#### Scalability: Growing<br><br>

 - **Throughput** 
   - Choose the best *load parameters* (e.g. GET or POST requests per second)
   - Question: what's the difference between 100k requests/sec of 1 kb each vs 3 requests/min of 2 gb each?
 - **Demand vs resources**
   - Vertical vs horizontal scalability
   - Linear scalability
   - Big O notation, % of task that's parallelizable 
 - **Latency and response time**
   - What is the speed at which applications respond to new requests?
   - Mean vs percentiles
   - Service Level Agreements (SLA's)

#### Maintainability: Running and Extending

The majority of the cost of software development is in ongoing maintentance<br><br>

 - Operability (ease of operation)
   - Maintenance
   - Monitoring
   - Upgrading
   - Debuggability
 - Generalization and extensibility
   - the reuse of existing codebase assets (e.g. code, test suites, architectures) within a software application
   - the ability to extend a system to new feature demands and the level of effort required to implement an extension
   - Quantified in ability to extend changes while minimizing impact to existing system functions
 - Automation
   - to what extent is an application a "closed loop" not requiring human oversight and intervention
   
Helpful design patern: decoupled storage and compute

###![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) DevOps vs "ModelOps"

DevOps = software development + IT operations<br><br>

* Manages the development life cycle
* Delivers features, patches, and updates
* Quality assurance
* Reduces friction between development, quality assurance, and production
* Strives to be agile (not waterfall)

"ModelOps" = data modeling + deployment operations

The problem...<br><br>

* Data scientists us a zoo of different frameworks and languages
* Scripting languages like Python and R are problematic in production
* Data scientists _love_ libraries.  Many are not production ready
* A chasm between academic and production solutions (scalability, reliability)
* Slow to update models
* Different code paths

Production means...<br><br>

* Deployment mostly in the Java ecosystem
  - Including Scala, which runs on the JVM
  - Also C/C++ and legacy environments 
* Using containers

Goals of "ModelOps"...<br><br>

* Brings machine learning into production
  - Model serialization
  - Refactor data science solutions
  - Containerizing 
* Adds model performance to testing and monitoring
* Reduces time between model development and deployment

Other considerations...<br><br>

* Training vs prediction time
  - Some models have high training but low prediction time, and vice versa
* IO vs CUP bound problems
* Live training algorithms

Other Technologies and Frameworks...<br><br>

* <a href="https://www.kubeflow.org/#overview" target="_blank">Kubeflow</a>: Data engineering focused way to manage the ML process
* <a href="https://code.fb.com/core-data/introducing-fblearner-flow-facebook-s-ai-backbone/" target="_blank">FBLearner</a>: Another way to manage the ML process
* <a href="https://github.com/databricks/spark-sklearn" target="_blank">Spark-sklearn</a>: Model tuning at scale with Spark
* <a href="http://mleap-docs.combust.ml/" target="_blank">MLeap</a>: Serialization format and execution engine for machine learning pipelines (`Spark`, `sklearn`, `TensorFlow`)
  - <a href="https://docs.databricks.com/spark/latest/mllib/mleap-model-export.html" target="_blank">See the Databricks docs for a runthrough</a>
* <a href="https://onnx.ai/supported-tools" target="_blank">Onnx</a>: A community project created by Facebook and Microsoft for greater interoperability in the AI tools community

Managing Deployments...<br><br>

 - SageMaker
 - Azure ML

Model registry requirements...<br><br>

- Stores models as first class citizens
  - trained
  - deployed
- Metadata 
  - origin story
  - metrics
  - monitoring
  - telemetry

###![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Three Architectures

Batch processing (Table Query)<br><br>

  - Example: churn prediction where high predicted churn generates targeted marketing
  - Most deployments are batch (approx 80%)
  - Save predictions to database
  - Query database in order to serve the predictions

-sandbox
<div><img src="https://files.training.databricks.com/images/eLearning/ML-Part-4/deployment-options.png" style="height: 400px; margin: 20px"/></div>

Continuous/Real-Time (Streaming)<br><br>

  - Typically asynchronous
  - Example: Live predictions on a stream of data (e.g. web activity logs)
  - Predict using a pre-trained model
  - Latency in seconds
  - Often using Spark

On demand (REST, Embedded)<br><br>

  - Typically synchronous
  - Millisecond latency
  - Normally served by REST or RMI
  - Example: live fraud detection on credit card transaction requests
  - Serialized or containerized model saved to a blob store or HDFS
  - Served using another engine (e.g. SageMaker or Azure ML) 
  - Latency in milliseconds
  - Could involve live model training

Phases of Deployment<br><br>

- dev/test/staging/production
- Model versioning
- When you should retrain, what you should retrain on (e.g. a trailing window, all data)
- Warm starts

A/B Testing<br><br>

- Technologies include Clipper, Optimizely, and Split IO   
- Can also be done with batch inference in a custom way

###![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Continuous Integration and Continuous Deployment

Continuous...<br><br>

* Integration: developers push changes often to build and run automated tests
* Delivery: automated, easy ways of deploying an application
* Deployment: one step further than continuous delivery, automated deployment has no human intervention in deployment

Solutions...<br><br>

 * Bamboo (confluence, might be replaced)
 * TeamCity
 * Jenkins
 * Git Lab
 * Travis CI
 * Airflow
 * Azure CI/CD pipelines
 * Amazon CodePipeline
 
<a href="https://databricks.com/blog/2017/10/30/continuous-integration-continuous-delivery-databricks.html" target="_blank">See blog for incorporation of Databricks in a CI/CD Pipeline</a>

## Next Steps

Start the next lesson, [Batch Deployment]($./06-Batch-Deployment ).

-sandbox
&copy; 2019 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>