# Chapter 1. Introduction

## Building Machine Learning Pipelines

Benefits of Machine Learning Pipelines:

1) Ability to focus on new models and not maintain existing models

2) Prevents bugs by keeping the same preprocessing steps through different iterations

3) Creates a paper trail of hyperparameters and datasets used as well as model metrics

4) Standardization across models results in efficiency across teams

Life Cycle of a Machine Learning Model:

1) Data ingestion/versioning

*   process the data in a version the following components can digest

2) Data validation
*   Check that distribution and categories are as expected
*   Check whether classes are balanced or imbalanced

3) Data preprocessing
*   Ex: one-hot encoding, tokenizing text
*   modifying preprocessing invalidates prior data and requires updating the entire pipeline

4) Model training and tuning

*   Train a model to take inputs and predict output with the lowest error possible

5) Model analysis

*   Go past just using accuracy or loss to determine optimal model; go on to use other metrics such as precision, recall, and AUC
*   Evaluate model's dependency on features used in training

6) Model versioning

*   Document all inputs into a new model version such as hyperparameters, datasets, and architecture

7) Model deployment

*   Use API interfaces like representational state transfer (REST) or remote procedure call (RPC) protocols
*   Host multiple models at the same time and perform A/B tests
*   Model servers can allow you to update a model version without redploying your application

8) Model feedback

*   Captures information about the performance of the model and possibly new data
*   This part may use a human and not be automated


Tools used to orchestrate the machine learning pileline:

*   Apache Beam
*   Apache airflow
*   Kubeflow Pipe.ines for Kubernetes infrastructure
*   TensorFlow ML MetadataStore


"Because of the two conditions (being directed and acyclic), pipeline graphs are called directed acyclic graphs (DAGs). You will discover DAGs are a central concept behind most workflow tools."

Example Project:
"To follow along with this book, we have created an example project using open source data. The dataset is a collection of consumer complaints about financial products in the United States, and it contains a mixture of structured data (categorical/numeric data) and unstructured data (text). The data is taken from the Consumer Finance Protection Bureau."

"The core of our example deep learning project is the model generated by the function get_model in the components/module.py script of our example project. The model predicts whether a consumer disputed a complaint using the following features:

The financial product

The subproduct

The company’s response to the complaint

The issue that the consumer complained about

The US state

The zip code

The text of the complaint (the narrative)

For the purpose of building the machine learning pipeline, we assume that the model architecture design is done and we won’t modify the model. We discuss the model architecture in more detail in Chapter 6. But for this book, the model architecture is a very minor point. This book is all about what you can do with your model once you have it."