# Software Engineering for Machine Learning: Characterizing and Detecting Mismatch in Machine-Learning Systems
Grace Lewis and Ipek Ozkaya, May 17, 2021, https://insights.sei.cmu.edu/blog/software-engineering-for-machine-learning-characterizing-and-detecting-mismatch-in-machine-learning-systems/

### Introduction
The passage discusses the growing interest in incorporating artificial intelligence (AI) and machine-learning (ML) components into software systems. This interest is fueled by the availability of tools for developing ML components and their potential to improve solutions for data-driven decision problems. However, integrating ML components into production systems presents challenges. Developing an ML system involves more than just building a model; it requires testing for production readiness, integration into larger systems, real-time monitoring, and adaptation to changing data. As a result of these complexities, a new field called software engineering for machine learning (SE4ML) is emerging.

An ML-enabled system is a software system that uses ML components for its capabilities. To create effective ML-enabled systems:

1. Integration of ML components should be easy.
    
2. The system must be set up for runtime monitoring of ML components and production data.
    
3. The training and retraining cycle of these systems should be fast.

While some software engineering practices directly apply to these requirements, they are often not commonly used in the data science field, which focuses on developing ML algorithms/models for integration. Adapting or extending software engineering practices is necessary to effectively deal with ML components.

ML Mismatch

In a blog post we published last June, Detecting Mismatches in Machine-Learning Systems, we observed that the ability to integrate ML components into applications is limited by, among other factors, mismatches between different system components. One reason is that development and deployment of ML-enabled systems involves three distinct disciplines: data science, software engineering, and operations. The distinct perspectives of these disciplines, when misaligned, cause ML mismatches that can result in failed systems. For example, if an ML model is trained on data that is different from data in the production environment, the performance of the ML component in the field will be reduced dramatically.

What makes ML components different from the traditional components in software systems is that they are highly data dependent. Their performance in production thus depends on how similar the production data is to the data that was used to train the ML model. This dependency is often called the training-serving skew. To succeed, ML-enabled systems must provide a way to know when model performance is degrading, and they must provide enough information to effectively retrain the models when it does. The more comprehensive and detailed the information gathered, the quicker a model can be developed, retrained, and redeployed.

Here are some examples of mismatch in ML-enabled systems:

    computing-resource mismatch—poor system performance because the computing resources that are required to execute the model are not available in the production environment
    data-distribution mismatch—poor model accuracy because the training data doesn’t match the production data
    API mismatch—the need to generate a lot of glue code because the ML component is expecting different inputs and outputs than what is provided by the system in which it is integrated
    test-data mismatch—inability of software engineers to properly test a component because they don’t have access to test data or don’t fully understand the component or know how to test it
    monitoring mismatch—inability of the monitoring tools in the production environment to collect ML-relevant metrics, such as model accuracy

As part of our work on SE4ML, we developed a set of machine-readable descriptors for elements of ML-enabled systems that externalize and codify the assumptions made by different system stakeholders. The goal for the descriptors is to support automated detection of mismatches at both design time and run time.

Characterizing and Detecting Mismatch in ML-Enabled Systems

Here are some examples of mismatch in ML-enabled systems:

    computing-resource mismatch—poor system performance because the computing resources that are required to execute the model are not available in the production environment
    data-distribution mismatch—poor model accuracy because the training data doesn’t match the production data
    API mismatch—the need to generate a lot of glue code because the ML component is expecting different inputs and outputs than what is provided by the system in which it is integrated
    test-data mismatch—inability of software engineers to properly test a component because they don’t have access to test data or don’t fully understand the component or know how to test it
    monitoring mismatch—inability of the monitoring tools in the production environment to collect ML-relevant metrics, such as model accuracy


    we followed with a practitioner survey that assessed the importance of sharing information in each of these categories to avoid mismatches


    To validate these findings, we conducted a survey where the main question was, How important is it for you as a practitioner to be aware of this information in order to avoid mismatch? Although we had only 31 responses to our survey, fewer than we wanted, the survey clearly affirmed that the information that we had gathered was representative of information that these respondents thought should be shared. Different information was deemed more or less important depending on the role of the respondent—data scientist, software engineer, or operations staff member—but taken as a whole, the results of the survey affirm the importance of communication to avoid mismatches.