# Software Engineering for Machine Learning: Characterizing and Detecting Mismatch in Machine-Learning Systems
Grace Lewis and Ipek Ozkaya, May 17, 2021, <a href="https://insights.sei.cmu.edu/blog/software-engineering-for-machine-learning-characterizing-and-detecting-mismatch-in-machine-learning-systems/" target="_blank">link</a>
### Introduction
The passage discusses the growing interest in incorporating artificial intelligence (AI) and machine-learning (ML) components into software systems. This interest is fueled by the availability of tools for developing ML components and their potential to improve solutions for data-driven decision problems. However, integrating ML components into production systems presents challenges. Developing an ML system involves more than just building a model; it requires testing for production readiness, integration into larger systems, real-time monitoring, and adaptation to changing data. As a result of these complexities, a new field called software engineering for machine learning (SE4ML) is emerging.

An ML-enabled system is a software system that uses ML components for its capabilities. To create effective ML-enabled systems:

1. Integration of ML components should be easy.
    
2. The system must be set up for runtime monitoring of ML components and production data.
    
3. The training and retraining cycle of these systems should be fast.

While some software engineering practices directly apply to these requirements, they are often not commonly used in the data science field, which focuses on developing ML algorithms/models for integration. Adapting or extending software engineering practices is necessary to effectively deal with ML components.

### ML Mismatch

The challenge of integrating ML components into applications is hindered by various factors, including discrepancies between different system components. This issue arises due to the involvement of three distinct disciplines in the development and deployment of ML-enabled systems: data science, software engineering, and operations. When these disciplines' perspectives don't align, it leads to ML mismatches, resulting in system failures.

ML components differ from traditional software components because they heavily rely on data. Their performance in production depends on how closely the production data resembles the data used to train the ML model. This dependency is known as the "training-serving skew." For ML-enabled systems to succeed, they need to offer a method to detect declining model performance and provide sufficient information for effective model retraining when this occurs.

Here are some examples of mismatch in ML-enabled systems:

1. **Insufficient Computing Resources**: Poor system performance arises when production environments lack the necessary computing resources to execute ML models effectively.

2. **Data Compatibility Issues**: Model accuracy suffers due to disparities between training and production data, impacting the model's real-world performance.

3. **API Alignment Problems**: Integrating ML components requires extensive glue code when the expected inputs and outputs differ from those provided by the system.

4. **Testing Complexities**: Limited access to appropriate test data, incomplete understanding of components, and inadequate testing approaches hinder software engineers from thoroughly testing ML integration.

5. **Monitoring Hurdles**: Production environment monitoring tools often fail to capture crucial ML metrics, such as model accuracy, complicating performance assessment.

As part of our work on SE4ML, we developed a set of machine-readable descriptors for elements of ML-enabled systems that externalize and codify the assumptions made by different system stakeholders. The goal for the descriptors is to support automated detection of mismatches at both design time and run time.

### Study to Characterize and Codify ML Mismatches
To address the challenge of ML mismatches, the team embarked on a project focused on creating machine-readable descriptors for components within ML-enabled systems. These descriptors capture and formalize the various assumptions made by different stakeholders involved in the system. The aim of these descriptors is to facilitate automated identification of mismatches during both the design and runtime phases.

Within the scope of this project, a two-phase study was conducted. In Phase 1, practitioners were interviewed to uncover instances of mismatches and their associated consequences. Simultaneously, a review of existing literature was conducted to identify documented best practices for developing ML systems. Moving to Phase 2, the validated mismatches identified through interviews were aligned with the system attributes outlined in the literature review. This alignment enabled the definition of attributes that would allow the detection of each specific mismatch. These attributes were then formalized into descriptors using the JSON schema.

### Phase 1 Results
In general they could detect seven categories of missmatches. The categories are as follows:

1. **trained model**: Mismatches in this category split evenly between information related to test cases and test data, and lack of information about the API and specifications.

2. **operational environment**: Many of the mismatches in this category were related to runtime metrics and data. The model was put into operation, and the operations staff did not know what they were supposed to monitor.

3. **task and purpose**: This category pertains to requirements: It focuses on the necessary communication between project owners and data scientists to ensure that the model developed by data scientists aligns with the expectations of the project owners.

4. **raw data**: Among the raw-data mismatches, many arose due to missing metadata. This includes essential details such as the data's collection methodology, time of collection, distribution approach, geographical origin, and the specific time frames during which it was gathered. Additionally, mismatches often involved incomplete descriptions of data elements, encompassing aspects like field names, explanations, values, and interpretations of any absent or null values.

5. **development environment**:The majority of these mismatches were related to programming languages. Mismatches frequently arose due to data scientists neglecting to communicate the programming language employed in developing the model, or software engineers not conveying information about the programming language utilized within the system itself.

6. **operational data**: Most mismatches here stemmed from lack of operational data statistics.

7. **training data**:

The majority of identified mismatches stem from inaccurate assumptions regarding the trained model (36%), which is developed by data scientists and integrated by software engineers into larger systems. The second most prevalent category is the operational environment (16%), encompassing the computing setting where the trained model operates, known as the model-serving or production environment.

Subsequent categories include task and purpose (15%), defining the model's expected functions and limitations, and raw data (10%), representing the data on which the training data is based. In smaller proportions, mismatches relate to the development environment (9%), utilized by software engineers for integrating and testing the model; operational data (8%), the data processed by the model during its operation; and training data (6%), employed to train the model. To find some of the examples of the different missmatch categories one can delve into following <a href="https://insights.sei.cmu.edu/blog/software-engineering-for-machine-learning-characterizing-and-detecting-mismatch-in-machine-learning-systems/" target="_blank">link</a>.

### Coclusion
The findings from Phase 1 of the study indicate that enhancing communication and automating the awareness and detection of ML mismatches can contribute to the enhancement of software engineering for ML-enabled systems.

### Peronal Examples:
I lack any industry experience in my background. In fact, my entire career has been within academic settings. As a result, I will attempt to align some of my academic experiences with the mentioned criteria and highlight the most relevant connections here. It's important to note that the specified mismatches can also arise within an academic environment, particularly among research group members collaborating on shared projects.

my experiences are:

**trained model**: 

Regarding this category, I'm currently facing a mismatch while working on my internship project. My task involves creating or improving a deep learning model to predict side effects of head and neck cancer. My supervisor suggested I begin by using a previously trained model, which was developed for similar tasks, on my dataset. The idea was to later improve upon it to achieve higher optimization and accuracy with my own model. 

However, as I started working with their model, I encountered an issue. I couldn't locate any documentation regarding the trained model, such as a readme file or additional explanations about its functioning. This lack of information made it challenging for me to comprehend how the model operates and integrate it effectively. Consequently, I reached out to my supervisor and asked for an explanation of the model's workings.

**operational environment**:

In terms of this mismatch category, During my participation in the 'Integrated Omics Project' course, I encountered a noteworthy experience. In this project, I constructed a pipeline responsible for processing raw images obtained from a biological organism by using x-ray microscope. This pipeline aimed to transform these images into a collection of endmembers, which could later be further refined into colored images using techniques like the FCLS method. However, it's important to note that these transformation techniques were time-consuming. Therefore, it held great significance to assess the quality of the image transformations beforehand, utilizing evaluation metrics.

With this goal in mind, I engaged with a colleague to make adjustments to the preliminary conditions of the image preprocessing stage within the pipeline. Regrettably, I omitted to explicitly communicate the requirement of employing evaluation metrics to gauge the resulting image quality. Consequently, my colleague proceeded under the assumption that running the pipeline and obtaining results were the primary objectives.

Upon gathering the results, we proceeded to employ the FCLS method to generate colored images from the endmembers. However, after consuming a week to execute across all endmembers groups, when we finally examined the resulting images, it became apparent that the endmember groups did not have the required quality. I recognized that I had forgotten to mention the essential use of evaluation metrics within the pipeline to my colleague, which could have solved this issue.

**task and purpose**:

Again I experienced 'task and purpose' mismatch during 'Integrated Omics Project' course. In the initial months, our group was left somewhat uncertain about our supervisor's precise expectations. Our task was to implement various methods for extracting abundance maps using different techniques. However, the ultimate goal and desired outcome of the project remained unclear, leaving us without a definitive endpoint or conclusion in mind.

Asking our supervisor for clarification about the project's intended endpoint shed light on the matter. It turned out that our objective was to categorize all the abundance maps and create distinct groups based on similarities in energy channels. This categorization was intended to aid biologists in effectively comparing the abundance maps and identifying the most accurate and relevant set.

**raw data**:

I encountered this issue while delving into the Titanic dataset for one of the exercises in my educational machine learning notebook. While others had previously explored this dataset and shared its metadata on their websites, I chose to rely solely on the Kaggle metadata provided for the dataset, which was unfortunately incomplete. My decision to do so was deliberate, as I aimed to demonstrate that a substantial amount of information about the dataset could be uncovered using the dataset itself.

Although I managed to successfully deduce the meaning of each feature, understand the measurement scales for these features, and decipher the implications of null values, this process consumed a considerable 2 to 3 days of my time. This experience underscored the significance of metadata for a data scientist—it can greatly streamline the research process and significantly expedite understanding.

**development environment**:

I've never encountered a mismatch arising from the use of different programming languages in a project. However, I did come across a situation during the 'Integrated Omics Project' course that falls into a similar mismatch category. In this particular course, my colleague and I each developed our own individual pipelines for the project. Yet, when we compared the results of these two pipelines, we noticed a discrepancy in the outcomes. The abundance maps extracted from the two pipelines using the same methods turned out to be different.

To address this, we spent a whole day investigating the issue. Eventually, we identified the source of the difference: the implementation of the PCA (Principal Component Analysis) method. My colleague employed the sklearn PCA package, while I opted to build the PCA from scratch. Upon reviewing the classes within the package, we discovered that the sklearn PCA package was based on the SVD (Singular Value Decomposition) method. In contrast, my approach involved matrix decomposition into eigenvalues and eigenvectors. This discrepancy in implementation led to the variation in our results.

**operational data**:

I have never experienced this one since all the test data that I used in my course are a part of my main dataset, and I randomly sliced a part of that for testing porpuses. Moreover, I have just started my internship, and I do not design my model yet let alone implementing it in a real situation. Furthermore, during 'Integrated Omics Project' course, the test data given to use had exactly the same shape and characteristics of the training data.

**training data**:

I have never had an experience like this. One reason can be I do not have much experience in this field, and another reason is that we have been tought to seperate the preprocessing part from the model designing section. Consequently, we always consider a seperate distinct part for preparing our training dataset.

---

# Tackling Collaboration Challenges in the Development of ML-Enabled Systems 
Grace Lewis, February 27, 2023, <a href="https://insights.sei.cmu.edu/blog/tackling-collaboration-challenges-in-the-development-of-ml-enabled-systems/" target="_blank">link</a>.

### Introduction
In software projects involving multiple developers, collaboration is essential. Development tasks are divided into system components, with team members working independently until integration. Component interfaces play a key role in determining collaboration points. Challenges arise when communication is difficult or interdisciplinary collaboration is needed. Experience, backgrounds, and differing expectations can hinder traditional development projects. Strategies and informal tools aid collaboration. Software lifecycle models like waterfall, spiral, and Agile assist in planning interfaces.

ML-enabled systems combine traditional development with ML components. This requires coordination between data science and software engineering for model creation, interface negotiation, and system operation. Data science expertise is vital for effective model building, but software engineers attempting this without proper training often create ineffective models. Data scientists may overlook engineering aspects affecting their models. Recent focus in software engineering research has been on testing, deployment, fairness, and robustness of ML models, with limited exploration of system-wide perspectives for ML-enabled systems.

### The structure of the research
Given the scarcity of existing research on collaboration within ML-enabled system development, they employed a qualitative approach for their study, organized into four key phases: (1) defining scope and conducting a comprehensive literature review, (2) engaging in interviews with professionals engaged in ML-enabled system construction, (3) corroborating the interview insights with findings from the literature review, and (4) validating the conclusions through feedback from the interviewees.

During their analysis, they found that certain teams were tasked with both model and software development, while in other scenarios, distinct teams managed software and model development separately. Across all the teams they studied, they did not identify any overarching trends that applied universally. However, as they shifted their focus to specific collaboration aspects, patterns did become apparent:

1. Requirements and Planning
2. Training Data
3. Product-Model Integration

Within these specific areas, they observed consistent trends and dynamics emerging among the teams they examined.

### Requirements and planning
There are different orders with which teams identify product and model requirements:

1. **Model first**: These teams build the model first and then build the product around the model. The model shapes product requirements

2. **Product first**: hese teams start with product development and then develop a model to support it. Most often, the product already exists, and new ML development seeks to enhance the product’s capabilities. in this group product requirements shape the model requirements.

3. **Parallel**: The model and product teams work in parallel.

Regardless of the type of trajectory a company is working with, they may face some tensions between product and model requirements.

1. Product requirements require input from the model team.

2. Model development with unclear requirements is common.

3. Provided model requirements rarely go beyond accuracy and data security. 

To tackle these tension, the authors present four solutions:

1. Involve data scientists early in the process.
   
2. Consider adopting a parallel development trajectory for product and model teams.
   
3. Conduct ML training sessions to educate clients and product teams.
    
4. Adopt more formal requirements documentation for both model and product.